# Merging DataFrames with Pandas

#### Preparing data
Reading multiple data files, reindexing dataframes,arithmetics with series and dataframes

#### Concatenating data
Appending with (.append), concat(), .reset_index(), ignore_index,keys and multiindexes,outer and inner joins,

#### Merging data
merge(), merging all columns (merge()), merging on multiple columns, using suffixes, specifying columns to merge, merge left, merge right, merging with inner join (how = 'innner'),merging with left join (how = 'left), using join(), ordered merges, using merge(how='outer'), sorting merge,merge_ordered(fill_method='ffill')



### What to use?
df1.append(df2) - stacking vertically
pd.concat([df1,df2]) stacking horizontaly or vertically, simple inner/outer join on indexes
df1.join(df2) - inner/outer/left/right/ join on Indexes
pd.merge([df1,df2]) many joins on multiple columns



More info on pivoting:

https://www.dataquest.io/blog/pandas-pivot-table/
https://www.kaggle.com/ostrowski/olympic-games-pivot-and-groupby-beginners-guide


#### Tools for pandas data import
pd.read_csv() for CSV files
dataframe = pd.read_csv(filepath)  + dozens of optional input parameters

Other data import tools:
pd.read_excel(), pd.read_html(), pd.read_json()


#### Loading separate files


In [None]:
##### Loading separate files
import pandas as pd
dataframe0 = pd.read_csv('sale-jan-2015.csv')
dataframe1 = pd.read_csv('sale-feb-2015.csv')


#instead load 2 files using a loop
filenames = ['sale-jan-2015.csv','sale-feb-2015.csv']
dataframes = []
for f in filenames:
    dataframes.append(pd.read_csv(f))

#other option of loading using a comprehension
filenames = ['sale-jan-2015.csv','sale-feb-2015.csv']
dataframes = [pd.read_csv(f) for f in filenames]


#using glob
from glob import glob
filenames = glop('sales*.csv')
dataframes = [pd.read_csv(f) for f in filenames]

### Reindexing DataFrames



In [None]:
import pandas as pd
w_mean = pd.read_csv('quarterly_mean.csv', index_col='Month')
w_max = pd.read_csv('quarterly_mean.csv', index_col='Month')

In [None]:
#check indexes
print(w_mean.index)

print(type(w_mean.index))


In [None]:
#using .reindex()
ordered = ['Jan', 'Apr', 'Jul', 'Oct']
w_mean2 = w_mean.reindex(ordered)
print(w_mean2)

In [None]:
#using .sort_index()
w_mean.sort_index()

In [None]:
#reindex from a dataframe index
w_mean.reindex(w_max.index)

In [None]:
#reindexing with missing labels
w_mean3 = w_mean.reindex(['Jan', 'Apr', 'Jul'])

In [None]:
#reindex from dataframe index
w_max.reindex(w_mean3.index)

w_max.reindex(w_mean3.index).dropna()

In [None]:
#other matters
w_max.reindex(w_mean.index)
w_mean.reindex(w_max.index)

### Arithmetics with series and dataframes

In [2]:
#loading weather data
import pandas as pd
weather = pd.read_csv('weather_data_austin_2010.csv',
                     index_col='Date', parse_dates = True)
weather.head(3)

Unnamed: 0_level_0,Temperature,DewPoint,Pressure
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-01 00:00:00,46.2,37.5,1
2010-01-01 01:00:00,44.6,37.1,1
2010-01-01 02:00:00,44.1,36.9,1


In [3]:
weather.loc['2010-01-01 00:00:00':'2010-01-01 02:00:00', 'Temperature']

Date
2010-01-01 00:00:00    46.2
2010-01-01 01:00:00    44.6
2010-01-01 02:00:00    44.1
Name: Temperature, dtype: float64

In [4]:
#scalar multiplication 
weather.loc['2010-01-01 00:00:00':'2010-01-01 02:00:00', 'Temperature'] *2.54

Date
2010-01-01 00:00:00    117.348
2010-01-01 01:00:00    113.284
2010-01-01 02:00:00    112.014
Name: Temperature, dtype: float64

In [9]:
#absolute temperature range
week1_range = weather.loc['2010-01-01 00:00:00':'2010-01-07 00:00:00',['Temperature']]


In [10]:
#percent changes
week1_range.pct_change() *100

Unnamed: 0_level_0,Temperature
Date,Unnamed: 1_level_1
2010-01-01 00:00:00,
2010-01-01 01:00:00,-3.463203
2010-01-01 02:00:00,-1.121076
2010-01-01 03:00:00,-0.680272
2010-01-01 04:00:00,-0.684932
2010-01-01 05:00:00,-1.149425
2010-01-01 06:00:00,0.232558
2010-01-01 07:00:00,-1.856148
2010-01-01 08:00:00,0.472813
2010-01-01 09:00:00,8.000000


In [11]:
#bronze olympic medals
bronze = pd.read_csv('bronze_top5.csv', index_col=0)
print(bronze)

                Total
Country              
United States    1052
Soviet Union      584
United Kingdom    505
France            475
Germany           454


In [12]:
#gold olympic medals
gold = pd.read_csv('gold_top5.csv', index_col=0)
print(gold)

                Total
Country              
United States    2088
Soviet Union      838
United Kingdom    498
Italy             460
Germany           407


In [13]:
#silver olympic medals
silver = pd.read_csv('silver_top5.csv', index_col=0)
print(silver)

                Total
Country              
United States    1195
Soviet Union      627
United Kingdom    591
France            461
Italy             394


In [14]:
#adding bronze and silver
bronze + silver

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
France,936.0
Germany,
Italy,
Soviet Union,1211.0
United Kingdom,1096.0
United States,2247.0


In [18]:
bronze.add(silver)

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
France,936.0
Germany,
Italy,
Soviet Union,1211.0
United Kingdom,1096.0
United States,2247.0


In [19]:
#using a fill_value
bronze.add(silver, fill_value = 0)

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
France,936.0
Germany,454.0
Italy,394.0
Soviet Union,1211.0
United Kingdom,1096.0
United States,2247.0


In [20]:
bronze+silver+gold

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
France,
Germany,
Italy,
Soviet Union,2049.0
United Kingdom,1594.0
United States,4335.0


In [21]:
#chaining .add()
bronze.add(silver, fill_value=0).add(gold, fill_value=0)

Unnamed: 0_level_0,Total
Country,Unnamed: 1_level_1
France,936.0
Germany,861.0
Italy,854.0
Soviet Union,2049.0
United Kingdom,1594.0
United States,4335.0


### Exercises

In [22]:
#reading dataframes from multiple files in a loop
# Import pandas
import pandas as pd

# Create the list of file names: filenames
filenames = ['gold.csv', 'silver.csv', 'bronze.csv']

# Create the list of three DataFrames: dataframes
dataframes = []
for filename in filenames:
    dataframes.append(pd.read_csv(filename))

# Print top 5 rows of 1st DataFrame in dataframes
print(dataframes[0].head())

   NOC         Country   Total
0  USA   United States  2088.0
1  URS    Soviet Union   838.0
2  GBR  United Kingdom   498.0
3  FRA          France   378.0
4  GER         Germany   407.0


In [23]:
print(dataframes[1].head())

   NOC         Country   Total
0  USA   United States  1195.0
1  URS    Soviet Union   627.0
2  GBR  United Kingdom   591.0
3  FRA          France   461.0
4  GER         Germany   350.0


In [24]:
'''
Combining DataFrames from multiple data files
In this exercise, you'll combine the three DataFrames from earlier exercises - gold, 
silver, & bronze - into a single DataFrame called medals. The approach you'll use here is
clumsy. Later on in the course, you'll see various powerful methods that are frequently used
in practice for concatenating or merging DataFrames.
Remember, the column labels of each DataFrame are NOC, Country, and Total, where NOC
is a three-letter code for the name of the country and Total is the number of medals
of that type won.
INSTRUCTIONS
100XP
Construct a copy of the DataFrame gold called medals using the .copy() method.
Create a list called new_labels with entries 'NOC', 'Country', & 'Gold'. This is the same
as the column labels from gold with the column label 'Total' replaced by 'Gold'.
Rename the columns of medals by assigning new_labels to medals.columns.
Create new columns 'Silver' and 'Bronze' in medals using silver['Total'] & bronze['Total'].
Print the top 5 rows of the final DataFrame medals. This has been done for you, so hit 
'Submit Answer' to see the result!
'''
# Import pandas
import pandas as pd

gold = pd.read_csv('gold.csv')
silver = pd.read_csv('silver.csv')
bronze = pd.read_csv('bronze.csv')
# Make a copy of gold: medals
medals = gold.copy()

# Create list of new column labels: new_labels
new_labels = ['NOC', 'Country', 'Gold']

# Rename the columns of medals using new_labels
medals.columns = new_labels

# Add columns 'Silver' & 'Bronze' to medals
medals['Silver'] = silver['Total']
medals['Bronze'] = bronze['Total']

# Print the head of medals
print(medals.head())

   NOC         Country    Gold  Silver  Bronze
0  USA   United States  2088.0  1195.0  1052.0
1  URS    Soviet Union   838.0   627.0   584.0
2  GBR  United Kingdom   498.0   591.0   505.0
3  FRA          France   378.0   461.0   475.0
4  GER         Germany   407.0   350.0   454.0


In [None]:
'''
Sorting DataFrame with the Index & columns
It is often useful to rearrange the sequence of the rows of a DataFrame by sorting.
You don't have to implement these yourself; the principal methods for doing this are
.sort_index() and .sort_values().
In this exercise, you'll use these methods with a DataFrame of temperature values
indexed by month names. You'll sort the rows alphabetically using the Index and 
numerically using a column. Notice, for this data, the original ordering is probably 
most useful and intuitive: the purpose here is for you to understand what the sorting 
methods do.

Read 'monthly_max_temp.csv' into a DataFrame called weather1 with 'Month' as the index.
Sort the index of weather1 in alphabetical order using the .sort_index() method and store 
the result in weather2.
Sort the index of weather1 in reverse alphabetical order by specifying the additional keyword 
argument ascending=False inside .sort_index().
Use the .sort_values() method to sort weather1 in increasing numerical order according
to the values of the column 'Max TemperatureF'.
'''
# Import pandas
import pandas as pd

# Read 'monthly_max_temp.csv' into a DataFrame: weather1
weather1 = pd.read_csv('monthly_max_temp.csv', index_col='Month')

# Print the head of weather1
print(weather1.head())

# Sort the index of weather1 in alphabetical order: weather2
weather2 = weather1.sort_index()

# Print the head of weather2
print(weather2.head())

# Sort the index of weather1 in reverse alphabetical order: weather3
weather3 = weather1.sort_index(ascending=False)

# Print the head of weather3
print(weather3.head())

# Sort weather1 numerically using the values of 'Max TemperatureF': weather4
weather4 = weather1.sort_values('Max TemperatureF')

# Print the head of weather4
print(weather4.head())

In [None]:
'''
Reindexing DataFrame from a list
Sorting methods are not the only way to change DataFrame Indexes. There is also the
.reindex() method.
In this exercise, you'll reindex a DataFrame of quarterly-sampled mean temperature values
to contain monthly samples (this is an example of upsampling or increasing the rate of samples,
which you may recall from the pandas Foundations course).
The original data has the first month's abbreviation of the quarter (three-month interval) 
on the Index, namely Apr, Jan, Jul, and Sep. This data has been loaded into a DataFrame 
called weather1 and has been printed in its entirety in the IPython Shell. Notice it has
only four rows (corresponding to the first month of each quarter) and that the rows are
not sorted chronologically.
You'll initially use a list of all twelve month abbreviations and subsequently apply
the .ffill() method to forward-fill the null entries when upsampling. This list of 
month abbreviations has been pre-loaded as year.
INSTRUCTIONS
100XP
Reorder the rows of weather1 using the .reindex() method with the list year as the argument,
which contains the abbreviations for each month.
Reorder the rows of weather1 just as you did above, this time chaining the .ffill() 
method to replace the null values with the last preceding non-null value.
'''
# Import pandas
import pandas as pd

# Reindex weather1 using the list year: weather2
weather2 = weather1.reindex(year)

# Print weather2
print(weather2)

# Reindex weather1 using the list year with forward-fill: weather3
weather3 = weather1.reindex(year).ffill()

# Print weather3
print(weather3)

In [32]:
'''
Reindexing using another DataFrame Index
Another common technique is to reindex a DataFrame using the Index of another DataFrame.
The DataFrame .reindex() method can accept the Index of a DataFrame or Series as input.
You can access the Index of a DataFrame with its .index attribute.
The Baby Names Dataset from data.gov summarizes counts of names (with genders) from births 
registered in the US since 1881. In this exercise, you will start with two baby-names 
DataFrames names_1981 and names_1881 loaded for you.
The DataFrames names_1981 and names_1881 both have a MultiIndex with levels name and gender
giving unique labels to counts in each row. If you're interested in seeing how the 
MultiIndexes were set up, names_1981 and names_1881 were read in using the following commands:
names_1981 = pd.read_csv('names1981.csv', header=None, names=['name','gender','count'],
index_col=(0,1))
names_1881 = pd.read_csv('names1881.csv', header=None, names=['name','gender','count'],
index_col=(0,1))
As you can see by looking at their shapes, which have been printed in the IPython Shell,
the DataFrame corresponding to 1981 births is much larger, reflecting the greater diversity
of names in 1981 as compared to 1881.
Your job here is to use the DataFrame .reindex() and .dropna() methods to make a DataFrame
common_names counting names from 1881 that were still popular in 1981.

Create a new DataFrame common_names by reindexing names_1981 using the Index of the DataFrame
names_1881 of older names.
Print the shape of the new common_names DataFrame. This has been done for you. It should be
the same as that of names_1881.
Drop the rows of common_names that have null counts using the .dropna() method. These rows 
correspond to names that fell out of fashion between 1881 & 1981.
Print the shape of the reassigned common_names DataFrame. This has been done for you,
so hit 'Submit Answer' to see the result!
'''
# Import pandas
import pandas as pd
names_1881=pd.read_csv('names1881.csv',header=None, names=['name','sex','number'],index_col=(0,1))
names_1981=pd.read_csv('names1981.csv',header=None, names=['name','sex','number'],index_col=(0,1))

names_1881.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,number
name,sex,Unnamed: 2_level_1
Mary,F,6919
Anna,F,2698
Emma,F,2034


In [33]:
names_1981.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,number
name,sex,Unnamed: 2_level_1
Jennifer,F,57032
Jessica,F,42519
Amanda,F,34370


In [34]:
# Reindex names_1981 with index of names_1881: common_names
common_names = names_1981.reindex(names_1881.index)

# Print shape of common_names
print(common_names.shape)

# Drop rows with null counts: common_names
common_names = common_names.dropna()

# Print shape of new common_names
print(common_names.shape)

(1935, 1)
(1587, 1)


In [None]:
'''
Broadcasting in arithmetic formulas
In this exercise, you'll work with weather data pulled from wunderground.com. 
The DataFrame weather has been pre-loaded along with pandas as pd. It has 365 rows 
(observed each day of the year 2013 in Pittsburgh, PA) and 22 columns reflecting different
weather measurements each day.
You'll subset a collection of columns related to temperature measurements in degrees 
Fahrenheit, convert them to degrees Celsius, and relabel the columns of the new DataFrame
to reflect the change of units.
Remember, ordinary arithmetic operators (like +, -, *, and /) broadcast scalar values
to conforming DataFrames when combining scalars & DataFrames in arithmetic expressions.
Broadcasting also works with pandas Series and NumPy arrays.

Create a new DataFrame temps_f by extracting the columns 'Min TemperatureF', 
'Mean TemperatureF', & 'Max TemperatureF' from weather as a new DataFrame temps_f.
To do this, pass the relevant columns as a list to weather[].
Create a new DataFrame temps_c from temps_f using the formula (temps_f - 32) * 5/9.
Rename the columns of temps_c to replace 'F' with 'C' using the .str.replace('F', 'C') 
method on temps_c.columns.
Print the first 5 rows of DataFrame temps_c. This has been done for you, so hit 'Submit Answer'
to see the result!
'''
# Extract selected columns from weather as new DataFrame: temps_f
temps_f = weather[['Min TemperatureF', 'Mean TemperatureF', 'Max TemperatureF']]

# Convert temps_f to celsius: temps_c
temps_c = (temps_f - 32) * 5/9

# Rename 'F' in column names with 'C': temps_c.columns
temps_c.columns = temps_c.columns.str.replace('F', 'C')

# Print first 5 rows of temps_c
print(temps_c.head())


In [36]:
'''
Computing percentage growth of GDP
Your job in this exercise is to compute the yearly percent-change of US GDP
(Gross Domestic Product) since 2008.
The data has been obtained from the Federal Reserve Bank of St. Louis and is available in
the file GDP.csv, which contains quarterly data; you will resample it to annual sampling and
then compute the annual growth of GDP. For a refresher on resampling, check out the relevant 
material from pandas Foundations.
INSTRUCTIONS
100XP
Read the file 'GDP.csv' into a DataFrame called gdp.
Use parse_dates=True and index_col='DATE'.
Create a DataFrame post2008 by slicing gdp such that it comprises all rows from 2008 onward.
Print the last 8 rows of the slice post2008. This has been done for you. This data has 
quarterly frequency so the indices are separated by three-month intervals.

Create the DataFrame yearly by resampling the slice post2008 by year. Remember, you
need to chain .resample() (using the alias 'A' for annual frequency) with some kind of
aggregation; you will use the aggregation method .last() to select the last element when
resampling.
Compute the percentage growth of the resampled DataFrame yearly with .pct_change() * 100.
'''
import pandas as pd

# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv('GDP_USA.csv', parse_dates=True, index_col='DATE')

gdp.head(3)

Unnamed: 0_level_0,VALUE
DATE,Unnamed: 1_level_1
1947-01-01,243.1
1947-04-01,246.3
1947-07-01,250.1


In [37]:
# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008':]

# Print the last 8 rows of post2008
print(post2008.tail(8))

# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()

# Print yearly
print(yearly)

# Compute percentage growth of yearly: yearly['growth']
yearly['growth'] = yearly.pct_change() * 100

# Print yearly again
print(yearly)

              VALUE
DATE               
2014-07-01  17569.4
2014-10-01  17692.2
2015-01-01  17783.6
2015-04-01  17998.3
2015-07-01  18141.9
2015-10-01  18222.8
2016-01-01  18281.6
2016-04-01  18436.5
              VALUE
DATE               
2008-12-31  14549.9
2009-12-31  14566.5
2010-12-31  15230.2
2011-12-31  15785.3
2012-12-31  16297.3
2013-12-31  16999.9
2014-12-31  17692.2
2015-12-31  18222.8
2016-12-31  18436.5
              VALUE    growth
DATE                         
2008-12-31  14549.9       NaN
2009-12-31  14566.5  0.114090
2010-12-31  15230.2  4.556345
2011-12-31  15785.3  3.644732
2012-12-31  16297.3  3.243524
2013-12-31  16999.9  4.311144
2014-12-31  17692.2  4.072377
2015-12-31  18222.8  2.999062
2016-12-31  18436.5  1.172707


In [38]:
'''
Converting currency of stocks
In this exercise, stock prices in US Dollars for the S&P 500 in 2015 have been obtained
from Yahoo Finance. The files sp500.csv for sp500 and exchange.csv for the exchange rates 
are both provided to you.
Using the daily exchange rate to Pounds Sterling, your task is to convert both the Open and
Close column prices.

Read the DataFrames sp500 & exchange from the files 'sp500.csv' & 'exchange.csv' respectively..
Use parse_dates=True and index_col='Date'.
Extract the columns 'Open' & 'Close' from the DataFrame sp500 as a new DataFrame dollars and 
print the first 5 rows.
Construct a new DataFrame pounds by converting US dollars to British pounds. You'll use
the .multiply() method of dollars with exchange['GBP/USD'] and axis='rows'
Print the first 5 rows of the new DataFrame pounds. This has been done for you, so hit 
'Submit Answer' to see the results!.
'''
# Import pandas
import pandas as pd

# Read 'sp500.csv' into a DataFrame: sp500
sp500 = pd.read_csv('sp500_2.csv', parse_dates=True, index_col='Date')

# Read 'exchange.csv' into a DataFrame: exchange
exchange = pd.read_csv('exchange.csv', parse_dates=True, index_col='Date')

# Subset 'Open' & 'Close' columns from sp500: dollars
dollars = sp500[['Open', 'Close']]

# Print the head of dollars
print(dollars.head())

# Convert dollars to pounds: pounds
pounds = dollars.multiply(exchange['GBP/USD'],axis='rows')

# Print the head of pounds
print(pounds.head())

                   Open        Close
Date                                
2015-02-01  2058.899902  2058.199951
2015-05-01  2054.439941  2020.579956
2015-06-01  2022.150024  2002.609985
2015-07-01  2005.550049  2025.900024
2015-08-01  2030.609985  2062.139893
                   Open        Close
Date                                
2015-01-04  1394.926501  1389.569819
2015-01-05  1379.069267  1392.883980
2015-01-06  1388.497197  1390.531957
2015-01-07          NaN          NaN
2015-01-09  1284.262247  1247.600522


## Appending and concatenating series

append() series and dataframe method
s1.appned(s2)
stack rows of s2 below s1

concat() - pd.concat([s1,s2,s3]), can stack row-wise or column-wise

Equivalence of concat() and append()
result1 = pd.concat([s1,s2,s3])
result2 = s1.append(s2).append(s3)
result1 == result2

In [40]:
#series of US states
import pandas as pd
northeast = pd.Series(['CT','ME','NH'])
south=pd.Series(['DE','FL','GA'])
midwest = pd.Series(['IL','IN','MN'])
west = pd.Series(['AZ','CO','ID'])

In [41]:
#using append()
east=northeast.append(south)
print(east)

0    CT
1    ME
2    NH
0    DE
1    FL
2    GA
dtype: object


In [42]:
#append index
print(east.index)

Int64Index([0, 1, 2, 0, 1, 2], dtype='int64')


In [44]:
print(east.loc[2])

2    NH
2    GA
dtype: object


In [45]:
#using .reset_index()
new_east = northeast.append(south).reset_index(drop=True)
print(new_east)

0    CT
1    ME
2    NH
3    DE
4    FL
5    GA
dtype: object


In [46]:
#using concat()
east = pd.concat([northeast,south])
print(east.head(11))

0    CT
1    ME
2    NH
0    DE
1    FL
2    GA
dtype: object


In [47]:
#concat using ingore_index
new_east = pd.concat([northeast,south], ignore_index=True)
print(new_east.head(11))

0    CT
1    ME
2    NH
3    DE
4    FL
5    GA
dtype: object


### Appending and concatenating DataFrames


In [49]:
import pandas as pd
pop1 = pd.read_csv('pop1.csv', index_col=0)
pop2 = pd.read_csv('pop2.csv', index_col=0)
print(type(pop1), pop1.shape)

<class 'pandas.core.frame.DataFrame'> (4, 1)


In [50]:
#examining data
print(pop1)

           zsta
zip code       
66407       479
72732      4716
50579      2405
46241     30670


In [51]:
#appending population dataframes
pop1.append(pop2)

Unnamed: 0_level_0,zsta
zip code,Unnamed: 1_level_1
66407,479
72732,4716
50579,2405
46241,30670
12776,2180
76092,26669
98360,12221
49464,27481


In [52]:
print(pop1.index.name, pop1.columns)

zip code Index(['zsta'], dtype='object')


In [58]:
#population and unemployemnt data
population=pd.read_csv('pop1.csv', index_col=0)
unemployment=pd.read_csv('unemployment1.csv', index_col=0)

In [59]:
print(population)

           zsta
zip code       
66407       479
72732      4716
50579      2405
46241     30670


In [60]:
print(unemployment)

          unemployment  participants
zip code                            
46241             0.06        243543
343543            0.09        646464
34343             0.17        646444
49463             0.10        333666


In [61]:
##appending population and unemployemnt

## see below repeated index table zip code 46241
population.append(unemployment)

Unnamed: 0_level_0,participants,unemployment,zsta
zip code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
66407,,,479.0
72732,,,4716.0
50579,,,2405.0
46241,,,30670.0
46241,243543.0,0.06,
343543,646464.0,0.09,
34343,646444.0,0.17,
49463,333666.0,0.1,


In [62]:
#concatenating rows
pd.concat([population,unemployment], axis=0)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0_level_0,participants,unemployment,zsta
zip code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
66407,,,479.0
72732,,,4716.0
50579,,,2405.0
46241,,,30670.0
46241,243543.0,0.06,
343543,646464.0,0.09,
34343,646444.0,0.17,
49463,333666.0,0.1,


In [63]:
#concatenating columns

#now looks good see zip 46421
pd.concat([population,unemployment], axis=1)

Unnamed: 0_level_0,zsta,unemployment,participants
zip code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
34343,,0.17,646444.0
46241,30670.0,0.06,243543.0
49463,,0.1,333666.0
50579,2405.0,,
66407,479.0,,
72732,4716.0,,
343543,,0.09,646464.0


### Concatenation, keys and multiinddexes


In [65]:
#loading rain data
import pandas as pd
file1 = 'rain2013.csv'
rain2013 = pd.read_csv(file1, index_col='month', parse_dates=True)
file1 = 'rain2014.csv'
rain2014 = pd.read_csv(file1, index_col='month', parse_dates=True)
print(rain2013)
print(rain2014)

       precipitation
month               
jan             0.09
feb             0.06
mar             0.07
       precipitation
month               
jan           0.0967
feb           0.0633
mar           0.0766


In [66]:
#concatenating rows
pd.concat([rain2013,rain2014], axis=0)

Unnamed: 0_level_0,precipitation
month,Unnamed: 1_level_1
jan,0.09
feb,0.06
mar,0.07
jan,0.0967
feb,0.0633
mar,0.0766


In [67]:
#using multi index on rows
rain1314=pd.concat([rain2013,rain2014], keys=[2013,2014],axis=0)
print(rain1314)

            precipitation
     month               
2013 jan           0.0900
     feb           0.0600
     mar           0.0700
2014 jan           0.0967
     feb           0.0633
     mar           0.0766


In [68]:
#accesing multi-index
print(rain1314.loc[2014])

       precipitation
month               
jan           0.0967
feb           0.0633
mar           0.0766


In [70]:
#concatenating columns
rain1314=pd.concat([rain2013,rain2014],axis='columns')
print(rain1314)

       precipitation  precipitation
month                              
jan             0.09         0.0967
feb             0.06         0.0633
mar             0.07         0.0766


In [71]:
#using a multiindex on columns
rain1314=pd.concat([rain2013,rain2014],keys=[2013,2014],axis='columns')
print(rain1314)

               2013          2014
      precipitation precipitation
month                            
jan            0.09        0.0967
feb            0.06        0.0633
mar            0.07        0.0766


In [72]:
rain1314[2013]

Unnamed: 0_level_0,precipitation
month,Unnamed: 1_level_1
jan,0.09
feb,0.06
mar,0.07


In [73]:
#pd.concat() with dict
rain_dict = {2013: rain2013, 2014: rain2014}
rain1314 = pd.concat(rain_dict, axis='columns')
print(rain1314)

               2013          2014
      precipitation precipitation
month                            
jan            0.09        0.0967
feb            0.06        0.0633
mar            0.07        0.0766


### Outer and Inner Joins

In [75]:
import numpy as np
import pandas as pd
A=np.arange(8).reshape(2,4)+0.1
print(A)


[[0.1 1.1 2.1 3.1]
 [4.1 5.1 6.1 7.1]]


In [77]:
B=np.arange(6).reshape(2,3)+0.2
print(B)

[[0.2 1.2 2.2]
 [3.2 4.2 5.2]]


In [78]:
C=np.arange(12).reshape(3,4)+0.3
print(C)

[[ 0.3  1.3  2.3  3.3]
 [ 4.3  5.3  6.3  7.3]
 [ 8.3  9.3 10.3 11.3]]


In [79]:
### Stacking arrays horizontaly
np.hstack([B,A])

array([[0.2, 1.2, 2.2, 0.1, 1.1, 2.1, 3.1],
       [3.2, 4.2, 5.2, 4.1, 5.1, 6.1, 7.1]])

In [80]:
np.concatenate([B,A], axis=1)

array([[0.2, 1.2, 2.2, 0.1, 1.1, 2.1, 3.1],
       [3.2, 4.2, 5.2, 4.1, 5.1, 6.1, 7.1]])

In [81]:
### stacking arrays verticaly
np.vstack([A,C])

array([[ 0.1,  1.1,  2.1,  3.1],
       [ 4.1,  5.1,  6.1,  7.1],
       [ 0.3,  1.3,  2.3,  3.3],
       [ 4.3,  5.3,  6.3,  7.3],
       [ 8.3,  9.3, 10.3, 11.3]])

In [82]:
np.concatenate([A,C], axis=0)

array([[ 0.1,  1.1,  2.1,  3.1],
       [ 4.1,  5.1,  6.1,  7.1],
       [ 0.3,  1.3,  2.3,  3.3],
       [ 4.3,  5.3,  6.3,  7.3],
       [ 8.3,  9.3, 10.3, 11.3]])

In [83]:
#population and umemployment data dataframes
#population and unemployemnt data
population=pd.read_csv('pop1.csv', index_col=0)
unemployment=pd.read_csv('unemployment1.csv', index_col=0)

In [84]:
print(population)

           zsta
zip code       
66407       479
72732      4716
50579      2405
46241     30670


In [85]:
#converting to arrays
population_array = np.array(population)
print(population_array)   #index info is lost

[[  479]
 [ 4716]
 [ 2405]
 [30670]]


In [86]:
unemployment_array = np.array(unemployment)
print(unemployment_array)

[[6.00000e-02 2.43543e+05]
 [9.00000e-02 6.46464e+05]
 [1.70000e-01 6.46444e+05]
 [1.00000e-01 3.33666e+05]]


In [87]:
#manipulating data as arrays
print(np.concatenate([population_array,unemployment_array], axis=1))

[[4.79000e+02 6.00000e-02 2.43543e+05]
 [4.71600e+03 9.00000e-02 6.46464e+05]
 [2.40500e+03 1.70000e-01 6.46444e+05]
 [3.06700e+04 1.00000e-01 3.33666e+05]]


### Joins
joining tables: combining rows of multiple tables
outer join
    union of index sets (all labels, no repetition)
    missing fields filled with NaN
inner join
    intersection of index sets (only common labels)
   

In [88]:
#concatenation and inner join
pd.concat([population, unemployment], axis=1, join='inner')

Unnamed: 0_level_0,zsta,unemployment,participants
zip code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
46241,30670,0.06,243543


In [89]:
#concatenation and outer join
pd.concat([population, unemployment], axis=1, join='outer')

Unnamed: 0_level_0,zsta,unemployment,participants
zip code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
34343,,0.17,646444.0
46241,30670.0,0.06,243543.0
49463,,0.1,333666.0
50579,2405.0,,
66407,479.0,,
72732,4716.0,,
343543,,0.09,646464.0


In [91]:
#inner join on other axis
pd.concat([population, unemployment], join='inner', axis=0)

66407
72732
50579
46241
46241
343543
34343
49463


### Exercises

In [1]:
'''
Appending pandas Series
In this exercise, you'll load sales data from the months January, February, and March 
into DataFrames. Then, you'll extract Series with the 'Units' column from each and append 
them together with method chaining using .append().
To check that the stacking worked, you'll print slices from these Series, and finally,
you'll add the result to figure out the total units sold in the first quarter.
INSTRUCTIONS
100XP
Read the files 'sales-jan-2015.csv', 'sales-feb-2015.csv' and 'sales-mar-2015.csv' into
the DataFrames jan, feb, and mar respectively.
Use parse_dates=True and index_col='Date'.
Extract the 'Units' column of jan, feb, and mar to create the Series jan_units, feb_units,
and mar_units respectively.
Construct the Series quarter1 by appending feb_units to jan_units and then appending
mar_units to the result. Use chained calls to the .append() method to do this.
Verify that quarter1 has the individual Series stacked vertically. To do this:
Print the slice containing rows from jan 27, 2015 to feb 2, 2015.
Print the slice containing rows from feb 26, 2015 to mar 7, 2015.
Compute and print the total number of units sold from the Series quarter1. This has been
done for you, so hit 'Submit Answer' to see the result!
'''
# Import pandas
import pandas as pd

# Load 'sales-jan-2015.csv' into a DataFrame: jan
jan = pd.read_csv('sales-jan-2015.csv', parse_dates=True, index_col='Date')

# Load 'sales-feb-2015.csv' into a DataFrame: feb
feb = pd.read_csv('sales-feb-2015.csv', parse_dates=True, index_col='Date')

# Load 'sales-mar-2015.csv' into a DataFrame: mar
mar = pd.read_csv('sales-mar-2015.csv', parse_dates=True, index_col='Date')

jan.head(3)

Unnamed: 0_level_0,Company,Product,Units
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-01-21 19:13:00,Streeplex,Hardware,11
2015-09-01 05:23:00,Streeplex,Service,8
2015-06-01 17:19:00,Initech,Hardware,17


In [2]:
# Extract the 'Units' column from jan: jan_units
jan_units = jan['Units']

# Extract the 'Units' column from feb: feb_units
feb_units = feb['Units']

# Extract the 'Units' column from mar: mar_units
mar_units = mar['Units']

# Append feb_units and then mar_units to jan_units: quarter1
quarter1 = jan_units.append(feb_units).append(mar_units)

quarter1.head(3)

Date
2015-01-21 19:13:00    11
2015-09-01 05:23:00     8
2015-06-01 17:19:00    17
Name: Units, dtype: int64

In [3]:
jan_units.head(3)

Date
2015-01-21 19:13:00    11
2015-09-01 05:23:00     8
2015-06-01 17:19:00    17
Name: Units, dtype: int64

In [4]:
# Print the first slice from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])

# Print the second slice from quarter1
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])

# Compute & print total sales in quarter1
print(quarter1.sum())

Date
2015-02-01 09:51:00    16
2015-01-27 07:11:00    18
2015-02-02 08:33:00     3
2015-02-02 20:54:00     9
Name: Units, dtype: int64
Date
2015-03-01 18:00:00    19
2015-02-26 08:57:00     4
2015-03-02 14:14:00    13
2015-02-26 08:58:00     1
Name: Units, dtype: int64
642


In [5]:
'''
Concatenating pandas Series along row axis
Having learned how to append Series, you'll now learn how to achieve the same result by concatenating Series instead. 
You'll continue to work with the sales data you've seen previously. This time, the DataFrames jan, feb, and mar have
been pre-loaded.
Your job is to use pd.concat() with a list of Series to achieve the same result that you would get by chaining 
calls to .append().
You may be wondering about the difference between pd.concat() and pandas' .append() method. One way to think 
of the difference is that .append() is a specific case of a concatenation, while pd.concat() gives you more
flexibility, as you'll see in later exercises.

Create an empty list called units. This has been done for you.
Use a for loop to iterate over [jan, feb, mar]:
In each iteration of the loop, append the 'Units' column of each DataFrame to units.
Concatenate the Series contained in the list units into a longer Series called quarter1 using pd.concat().
Specify the keyword argument axis='rows' to stack the Series vertically.
Verify that quarter1 has the individual Series stacked vertically by printing slices. This has been done for you,
so hit 'Submit Answer' to see the result!
'''
# Initialize empty list: units
units = []

# Build the list of Series
for month in [jan, feb, mar]:
    units.append(month['Units'])

# Concatenate the list: quarter1
quarter2 = pd.concat(units, axis='rows')

# Print slices from quarter1
print(quarter2.loc['jan 27, 2015':'feb 2, 2015'])
print(quarter2.loc['feb 26, 2015':'mar 7, 2015'])

Date
2015-02-01 09:51:00    16
2015-01-27 07:11:00    18
2015-02-02 08:33:00     3
2015-02-02 20:54:00     9
Name: Units, dtype: int64
Date
2015-03-01 18:00:00    19
2015-02-26 08:57:00     4
2015-03-02 14:14:00    13
2015-02-26 08:58:00     1
Name: Units, dtype: int64


In [7]:
names_1881=pd.read_csv('names1881.csv',header=None, names=['name','sex','number'],index_col=(0,1))
names_1981=pd.read_csv('names1981.csv',header=None, names=['name','sex','number'],index_col=(0,1))

names_1881.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,number
name,sex,Unnamed: 2_level_1
Mary,F,6919
Anna,F,2698
Emma,F,2034


In [9]:
'''
Appending DataFrames with ignore_index
In this exercise, you'll use the Baby Names Dataset (from data.gov) again. This time, both DataFrames names_1981 and 
names_1881 are loaded without specifying an Index column (so the default Indexes for both are RangeIndexes).
You'll use the DataFrame .append() method to make a DataFrame combined_names. To distinguish rows from the original 
two DataFrames, you'll add a 'year' column to each with the year (1881 or 1981 in this case). In addition, 
you'll specify ignore_index=True so that the index values are not used along the concatenation axis. 
The resulting axis will instead be labeled 0, 1, ..., n-1, which is useful if you are concatenating objects 
where the concatenation axis does not have meaningful indexing information.

Create a 'year' column in the DataFrames names_1881 and names_1981, with values of 1881 and 1981 respectively.
Recall that assigning a scalar value to a DataFrame column broadcasts that value throughout.
Create a new DataFrame called combined_names by appending the rows of names_1981 underneath the rows of names_1881. 
Specify the keyword argument ignore_index=True to make a new RangeIndex of unique integers for each row.
Print the shapes of all three DataFrames. This has been done for you.
Extract all rows from combined_names that have the name 'Morgan'. To do this, use the .loc[] accessor with an appropriate 
filter. The relevant column of combined_names here is 'name'.
'''
# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981

# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = names_1881.append(names_1981, ignore_index=True)

# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)


(19455, 2)
(1935, 2)
(21390, 2)


In [12]:
#append with no change in index
combined_names2 = names_1881.append(names_1981)
combined_names2.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,number,year
name,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
Mary,F,6919,1881
Anna,F,2698,1881
Emma,F,2034,1881


In [21]:
#reindexing with missing labe
combined_names3= combined_names2.set_index('year')
combined_names3.head(3)

Unnamed: 0_level_0,name,sex,number
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1881,Mary,F,6919
1881,Anna,F,2698
1881,Emma,F,2034


In [23]:
print(combined_names3.loc[combined_names3['name'] == 'Morgan'])

        name sex  number
year                    
1881  Morgan   M      23
1981  Morgan   F    1769
1981  Morgan   M     766


In [None]:
'''
Concatenating pandas DataFrames along column axis
The function pd.concat() can concatenate DataFrames horizontally as well as vertically (vertical is the default).
To make the DataFrames stack horizontally, you have to specify the keyword argument axis=1 or axis='columns'.
In this exercise, you'll use weather data with maximum and mean daily temperatures sampled at different rates
(quarterly versus monthly). You'll concatenate the rows of both and see that, where rows are missing in the coarser
DataFrame, null values are inserted in the concatenated DataFrame. This corresponds to an outer join 
(which you will explore in more detail in later exercises).
The files 'quarterly_max_temp.csv' and 'monthly_mean_temp.csv' have been pre-loaded into the DataFrames 
weather_max and weather_mean respectively, and pandas has been imported as pd.

Create a new DataFrame called weather by concatenating the DataFrames weather_max and weather_mean horizontally.
Pass the DataFrames to pd.concat() as a list and specify the keyword argument axis=1 to stack them horizontally.
Print the new DataFrame weather.
'''
# Concatenate weather_max and weather_mean horizontally: weather
weather = pd.concat([weather_max, weather_mean], axis=1)

# Print weather
print(weather)

In [26]:
'''
Reading multiple files to build a DataFrame
It is often convenient to build a large DataFrame by parsing many files as DataFrames and concatenating 
them all at once. You'll do this here with three files, but, in principle, this approach can be used to 
combine data from dozens or hundreds of files.
Here, you'll work with DataFrames compiled from The Guardian's Olympic medal dataset.
pandas has been imported as pd and two lists have been pre-loaded: An empty list called medals, and medal_types,
which contains the strings 'bronze', 'silver', and 'gold'.

Iterate over medal_types in the for loop.
Inside the for loop:
Create file_name using string interpolation with the loop variable medal. This has been done for you. 
The expression "%s_top5.csv" % medal evaluates as a string with the value of medal replacing %s in the format string.
Create the list of column names called columns. This has been done for you.
Read file_name into a DataFrame called medal_df. Specify the keyword arguments header=0, index_col='Country',
and names=columns to get the correct row and column Indexes.
Append medal_df to medals using the list .append() method.
Concatenate the list of DataFrames medals horizontally (using axis='columns') to create a single DataFrame
called medals. Print it in its entirety.
'''
medals=[]
medal_types=['bronze','silver','gold']
for medal in medal_types:

    # Create the file name: file_name
    file_name = "%s_top5.csv" % medal
    
    # Create list of column names: columns
    columns = ['Country', medal]
    
    # Read file_name into a DataFrame: df
    medal_df = pd.read_csv(file_name, header=0, index_col='Country', names=columns)

    # Append medal_df to medals
    medals.append(medal_df)


print(medals)

[                bronze
Country               
United States     1052
Soviet Union       584
United Kingdom     505
France             475
Germany            454,                 silver
Country               
United States     1195
Soviet Union       627
United Kingdom     591
France             461
Italy              394,                 gold
Country             
United States   2088
Soviet Union     838
United Kingdom   498
Italy            460
Germany          407]


In [27]:
# Concatenate medals horizontally: medals
medals = pd.concat(medals, axis='columns')

# Print medals
print(medals)

                bronze  silver    gold
France           475.0   461.0     NaN
Germany          454.0     NaN   407.0
Italy              NaN   394.0   460.0
Soviet Union     584.0   627.0   838.0
United Kingdom   505.0   591.0   498.0
United States   1052.0  1195.0  2088.0


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


In [28]:
'''
Concatenating vertically to get MultiIndexed rows
When stacking a sequence of DataFrames vertically, it is sometimes desirable to construct a MultiIndex to indicate
the DataFrame from which each row originated. This can be done by specifying the keys parameter in the call to
pd.concat(), which generates a hierarchical index with the labels from keys as the outermost index label. So
you don't have to rename the columns of each DataFrame as you load it. Instead, only the Index column needs
to be specified.
Here, you'll continue working with DataFrames compiled from The Guardian's Olympic medal dataset. Once again, 
pandas has been imported as pd and two lists have been pre-loaded: An empty list called medals, and medal_types,
which contains the strings 'bronze', 'silver', and 'gold'.

Within the for loop:
Read file_name into a DataFrame called medal_df. Specify the index to be 'Country'.
Append medal_df to medals.
Concatenate the list of DataFrames medals into a single DataFrame called medals. Be sure to use the keyword
argument keys=['bronze', 'silver', 'gold'] to create a vertically stacked DataFrame with a MultiIndex.
Print the new DataFrame medals. This has been done for you, so hit 'Submit Answer' to see the result!
'''
medals=[]
medal_types=['bronze','silver','gold']

for medal in medal_types:

    file_name = "%s_top5.csv" % medal

    # Read file_name into a DataFrame: medal_df
    medal_df = pd.read_csv(file_name, index_col='Country')
    
    # Append medal_df to medals
    medals.append(medal_df)

# Concatenate medals: medals
medals = pd.concat(medals, keys=['bronze', 'silver', 'gold'])

# Print medals
print(medals)

                       Total
       Country              
bronze United States    1052
       Soviet Union      584
       United Kingdom    505
       France            475
       Germany           454
silver United States    1195
       Soviet Union      627
       United Kingdom    591
       France            461
       Italy             394
gold   United States    2088
       Soviet Union      838
       United Kingdom    498
       Italy             460
       Germany           407


In [29]:
'''
Slicing MultiIndexed DataFrames
This exercise picks up where the last ended (again using The Guardian's Olympic medal dataset).
You are provided with the MultiIndexed DataFrame as produced at the end of the preceding exercise. 
Your task is to sort the DataFrame and to use the pd.IndexSlice to extract specific slices. 
Check out this exercise from Manipulating DataFrames with pandas to refresh your memory on how
to deal with MultiIndexed DataFrames.
pandas has been imported for you as pd and the DataFrame medals is already in your namespace.

Create a new DataFrame medals_sorted with the entries of medals sorted. Use .sort_index(level=0) to ensure 
the Index is sorted suitably.
Print the number of bronze medals won by Germany and all of the silver medal data. This has been done for you.
Create an alias for pd.IndexSlice called idx. A slicer pd.IndexSlice is required when slicing on the inner level
of a MultiIndex.
Slice all the data on medals won by the United Kingdom. To do this, use the .loc[] accessor 
with idx[:,'United Kingdom'], :.
'''
# Sort the entries of medals
medals_sorted = medals.sort_index(level=0)

# Print the number of Bronze medals won by Germany
print(medals_sorted.loc[('bronze','Germany')])


Total    454
Name: (bronze, Germany), dtype: int64


In [30]:
# Print data about silver medals
print(medals_sorted.loc['silver'])


                Total
Country              
France            461
Italy             394
Soviet Union      627
United Kingdom    591
United States    1195


In [31]:
# Create alias for pd.IndexSlice: idx
idx = pd.IndexSlice

# Print all the data on medals won by the United Kingdom
print(medals_sorted.loc[idx[:,'United Kingdom'], :])

                       Total
       Country              
bronze United Kingdom    505
gold   United Kingdom    498
silver United Kingdom    591


In [None]:
'''
Concatenating horizontally to get MultiIndexed columns
It is also possible to construct a DataFrame with hierarchically indexed columns. For this exercise,
you'll start with pandas imported and a list of three DataFrames called dataframes. All three DataFrames contain
'Company', 'Product', and 'Units' columns with a 'Date' column as the index pertaining to sales transactions during 
the month of February, 2015. The first DataFrame describes Hardware transactions, the second describes Software 
ransactions, and the third, Service transactions.
Your task is to concatenate the DataFrames horizontally and to create a MultiIndex on the columns.
From there, you can summarize the resulting DataFrame and slice some information from it.
INSTRUCTIONS
100XP
Construct a new DataFrame february with MultiIndexed columns by concatenating the list dataframes.
Use axis=1 to stack the DataFrames horizontally and the keyword argument keys=['Hardware', 'Software', 'Service'] 
to construct a hierarchical Index from each DataFrame.
Print summary information from the new DataFrame february using the .info() method. This has been done for you.
Create an alias called idx for pd.IndexSlice.
Extract a slice called slice_2_8 from february (using .loc[] & idx) that comprises rows between Feb. 2, 
2015 to Feb. 8, 2015 from columns under 'Company'.
Print the slice_2_8. This has been done for you, so hit 'Submit Answer' to see the sliced data!
'''
# Concatenate dataframes: february
february = pd.concat(dataframes, axis=1, keys=['Hardware', 'Software', 'Service'])

# Print february.info()
print(february.info())

# Assign pd.IndexSlice: idx
idx = pd.IndexSlice

# Create the slice: slice_2_8
slice_2_8 = february.loc['2015-2-2':'2015-2-8', idx[:, 'Company']]

# Print slice_2_8
print(slice_2_8)

In [None]:
'''
You're now going to revisit the sales data you worked with earlier in the chapter. Three DataFrames jan, 
feb, and mar have been pre-loaded for you. Your task is to aggregate the sum of all sales over the 'Company' 
column into a single DataFrame. You'll do this by constructing a dictionary of these DataFrames and then 
concatenating them.

Create a list called month_list consisting of the tuples ('january', jan), ('february', feb), and ('march', mar).
Create an empty dictionary called month_dict.
Inside the for loop:
Group month_data by 'Company' and use .sum() to aggregate.
Construct a new DataFrame called sales by concatenating the DataFrames stored in month_dict.
Create an alias for pd.IndexSlice and print all sales by 'Mediacore'. This has been done for you, so hit 
'Submit Answer' to see the result!
'''
# Make the list of tuples: month_list
month_list = [('january', jan), ('february', feb), ('march', mar)]

# Create an empty dictionary: month_dict
month_dict = {}

for month_name, month_data in month_list:

    # Group month_data: month_dict[month_name]
    month_dict[month_name] = month_data.groupby('Company').sum()

# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)

# Print sales
print(sales)

# Print all sales by Mediacore
idx = pd.IndexSlice
print(sales.loc[idx[:, 'Mediacore'], :])

In [34]:
'''
Concatenating DataFrames with inner join
Here, you'll continue working with DataFrames compiled from The Guardian's Olympic medal dataset.
The DataFrames bronze, silver, and gold have been pre-loaded for you.
Your task is to compute an inner join.
INSTRUCTIONS
100XP
Construct a list of DataFrames called medal_list with entries bronze, silver, and gold.
Concatenate medal_list horizontally with an inner join to create medals.
Use the keyword argument keys=['bronze', 'silver', 'gold'] to yield suitable hierarchical indexing.
Use axis=1 to get horizontal concatenation.
Use join='inner' to keep only rows that share common index labels.
Print the new DataFrame medals.
'''
gold = pd.read_csv('gold.csv')
silver = pd.read_csv('silver.csv')
bronze = pd.read_csv('bronze.csv')

# Create the list of DataFrames: medal_list
medal_list = [bronze, silver, gold]

# Concatenate medal_list horizontally using an inner join: medals
medals = pd.concat(medal_list, keys=['bronze', 'silver', 'gold'], axis=1, join='inner')

# Print medals
medals.head(5)

Unnamed: 0_level_0,bronze,bronze,bronze,silver,silver,silver,gold,gold,gold
Unnamed: 0_level_1,NOC,Country,Total,NOC,Country,Total,NOC,Country,Total
0,USA,United States,1052.0,USA,United States,1195.0,USA,United States,2088.0
1,URS,Soviet Union,584.0,URS,Soviet Union,627.0,URS,Soviet Union,838.0
2,GBR,United Kingdom,505.0,GBR,United Kingdom,591.0,GBR,United Kingdom,498.0
3,FRA,France,475.0,FRA,France,461.0,FRA,France,378.0
4,GER,Germany,454.0,GER,Germany,350.0,GER,Germany,407.0


In [37]:
'''
Resampling & concatenating DataFrames with inner join
In this exercise, you'll compare the historical 10-year GDP (Gross Domestic Product) growth in the US and
in China. The data for the US starts in 1947 and is recorded quarterly; by contrast, the data for China starts 
in 1966 and is recorded annually.
You'll need to use a combination of resampling and an inner join to align the index labels. You'll need an 
appropriate offset alias for resampling, and the method .resample() must be chained with some kind of aggregation
method (.pct_change() and .last() in this case).
pandas has been imported as pd, and the DataFrames china and us have been pre-loaded, with the output of china.head() 
and us.head() printed in the IPython Shell.
INSTRUCTIONS
100XP
Make a new DataFrame china_annual by resampling the DataFrame china with .resample('A') (i.e., with annual frequency) 
and chaining two method calls:
Chain .pct_change(10) as an aggregation method to compute the percentage change with an offset of ten years.
Chain .dropna() to eliminate rows containing null values.
Make a new DataFrame us_annual by resampling the DataFrame us exactly as you resampled china.
Concatenate china_annual and us_annual to construct a DataFrame called gdp. Use join='inner' to perform an inner
join and use axis=1 to concatenate horizontally.
Print the result of resampling gdp every decade (i.e., using .resample('10A')) and aggregating with the method .last(). 
This has been done for you, so hit 'Submit Answer' to see the result!
'''
china=pd.read_csv('gdp_china.csv',index_col='Year', parse_dates = True)
us=pd.read_csv('gdp_usa.csv',index_col='DATE', parse_dates = True)
# Resample and tidy china: china_annual
china_annual = china.resample('A').last().pct_change(10).dropna()

# Resample and tidy us: us_annual
us_annual = us.resample('A').last().pct_change(10).dropna()

# Concatenate china_annual and us_annual: gdp
gdp = pd.concat([china_annual, us_annual], join='inner', axis=1)

# Resample gdp and print
print(gdp.resample('10A').last())

                 GDP     VALUE
Year                          
1970-12-31  0.546128  1.017187
1980-12-31  1.072537  1.742556
1990-12-31  0.892820  1.012126
2000-12-31  2.357522  0.738632
2010-12-31  4.011081  0.454332
2020-12-31  3.789936  0.361780


In [38]:
china_annual.head(3)

Unnamed: 0_level_0,GDP
Year,Unnamed: 1_level_1
1970-12-31,0.546128
1971-12-31,0.98886
1972-12-31,1.402472


In [39]:
china.head(3)

Unnamed: 0_level_0,GDP
Year,Unnamed: 1_level_1
1960-01-01,59.184116
1961-01-01,49.55705
1962-01-01,46.685178


In [40]:
us_annual.head(3)

Unnamed: 0_level_0,VALUE
DATE,Unnamed: 1_level_1
1957-12-31,0.827507
1958-12-31,0.782686
1959-12-31,0.953137


## Merging DataFrames


In [2]:
import pandas as pd
population = pd.read_csv('pa_zipcode_population.csv')
population.head(3)

Unnamed: 0,zipcode,census
0,16855,282
1,15681,5678
2,18657,11222


In [3]:
cities = pd.read_csv('pa_zipcode_city.csv')
cities.head(3)

Unnamed: 0,zipcode,city,state
0,16855,mank,PA
1,15681,dff,PA
2,18657,ggg,PA


In [4]:
#merging
pd.merge(population,cities)

Unnamed: 0,zipcode,census,city,state
0,16855,282,mank,PA
1,15681,5678,dff,PA
2,18657,11222,ggg,PA
3,17307,5443,sgg,PA
4,15535,227,edc,PA


In [5]:
#medal dataframes
bronze = pd.read_csv('bronze.csv')
gold = pd.read_csv('gold.csv')
bronze.head(3)

Unnamed: 0,NOC,Country,Total
0,USA,United States,1052.0
1,URS,Soviet Union,584.0
2,GBR,United Kingdom,505.0


In [6]:
#merging all columns
pd.merge(bronze,gold)

Unnamed: 0,NOC,Country,Total
0,ESP,Spain,92.0
1,IRL,Ireland,8.0
2,SYR,Syria,1.0
3,MOZ,Mozambique,1.0
4,SUR,Suriname,1.0
5,PAR,Paraguay,
6,SCG,Serbia,
7,NAM,Namibia,
8,SIN,Singapore,
9,SRI,Sri Lanka,


In [9]:
#merging on
merge_on = pd.merge(bronze, gold, on='NOC')
merge_on.head(3)

Unnamed: 0,NOC,Country_x,Total_x,Country_y,Total_y
0,USA,United States,1052.0,United States,2088.0
1,URS,Soviet Union,584.0,Soviet Union,838.0
2,GBR,United Kingdom,505.0,United Kingdom,498.0


In [10]:
#merging on multiple columns
merge_multiple = pd.merge(bronze,gold,on=['NOC','Country'])
merge_multiple.head(3)

Unnamed: 0,NOC,Country,Total_x,Total_y
0,USA,United States,1052.0,2088.0
1,URS,Soviet Union,584.0,838.0
2,GBR,United Kingdom,505.0,498.0


In [11]:
#using suffixes
merge_suf = pd.merge(bronze,gold, on=['NOC','Country'], suffixes=['_bronze', '_gold'])
merge_suf.head(3)

Unnamed: 0,NOC,Country,Total_bronze,Total_gold
0,USA,United States,1052.0,2088.0
1,URS,Soviet Union,584.0,838.0
2,GBR,United Kingdom,505.0,498.0


In [12]:
#counties dataframe
counties = pd.read_csv('pa_counties.csv')
counties

Unnamed: 0,CITY NAME,COUNTY NAME
0,SALTSBURG,INDIANA
1,MINERAL SPRINGS,CLEARFIELD
2,BIGLERVILLE,ADAMS
3,HANNASTOWN,WESTMORELAND
4,TUNKHANNOCK,WYOMING


In [13]:
cities = pd.read_csv('cities.csv')
cities

Unnamed: 0,zipcode,city,state
0,13454,SALTSBURG,PA
1,13445,GREAT BEND,PA
2,33244,PITTSBURG,PA
3,44322,LEMASTERS,PA
4,44333,TUNKHANNOCK,PA


In [14]:
#merge specifying columns
pd.merge(counties,cities,left_on='CITY NAME', right_on='city')

Unnamed: 0,CITY NAME,COUNTY NAME,zipcode,city,state
0,SALTSBURG,INDIANA,13454,SALTSBURG,PA
1,TUNKHANNOCK,WYOMING,44333,TUNKHANNOCK,PA


In [15]:
#switching left/right dataframes
pd.merge(cities,counties, left_on='city', right_on='CITY NAME')

Unnamed: 0,zipcode,city,state,CITY NAME,COUNTY NAME
0,13454,SALTSBURG,PA,SALTSBURG,INDIANA
1,44333,TUNKHANNOCK,PA,TUNKHANNOCK,WYOMING


### joining data frames

use bronze and gold files

In [16]:
bronze.head(3)

Unnamed: 0,NOC,Country,Total
0,USA,United States,1052.0
1,URS,Soviet Union,584.0
2,GBR,United Kingdom,505.0


In [17]:
inner_merge = pd.merge(bronze, gold, on=['NOC','Country'],
        suffixes = ['_bronze', '_gold'], how='inner')
inner_merge.head(3)

Unnamed: 0,NOC,Country,Total_bronze,Total_gold
0,USA,United States,1052.0,2088.0
1,URS,Soviet Union,584.0,838.0
2,GBR,United Kingdom,505.0,498.0


### Merging with left join
Keeps all rows of the left DF in the merged DF
For rows in the left DF with matches in the right DF:
        Non-joining columns of right DF are appended to the left DF
FOr rows in the left DF with no matches in the right DF:
    non-joining columns are filled with nulls

In [18]:
#merging with left join
left_join_merge = pd.merge(bronze, gold, on=['NOC', 'Country'],
                          suffixes=['_bronze', '_gold'], how='left')
left_join_merge.head(5)

Unnamed: 0,NOC,Country,Total_bronze,Total_gold
0,USA,United States,1052.0,2088.0
1,URS,Soviet Union,584.0,838.0
2,GBR,United Kingdom,505.0,498.0
3,FRA,France,475.0,378.0
4,GER,Germany,454.0,407.0


In [19]:
#merging with right join
right_join_merge = pd.merge(bronze, gold, on=['NOC', 'Country'],
                          suffixes=['_bronze', '_gold'], how='right')
right_join_merge.head(5)

Unnamed: 0,NOC,Country,Total_bronze,Total_gold
0,USA,United States,1052.0,2088.0
1,URS,Soviet Union,584.0,838.0
2,GBR,United Kingdom,505.0,498.0
3,FRA,France,475.0,378.0
4,GER,Germany,454.0,407.0


In [20]:
#merging with outer join
#merging with right join
outer_join_merge = pd.merge(bronze, gold, on=['NOC', 'Country'],
                          suffixes=['_bronze', '_gold'], how='outer')
outer_join_merge.head(5)

Unnamed: 0,NOC,Country,Total_bronze,Total_gold
0,USA,United States,1052.0,2088.0
1,URS,Soviet Union,584.0,838.0
2,GBR,United Kingdom,505.0,498.0
3,FRA,France,475.0,378.0
4,GER,Germany,454.0,407.0


In [28]:
#
unemployment = pd.read_csv('unemployment1.csv', index_col=0)
unemployment.head(5)


Unnamed: 0_level_0,unemployment,participants
zip code,Unnamed: 1_level_1,Unnamed: 2_level_1
46241,0.06,243543
343543,0.09,646464
34343,0.17,646444
49463,0.1,333666


In [33]:
population = pd.read_csv('pa_zipcode_population.csv', index_col=0)
population.head(5)


Unnamed: 0_level_0,census
zipcode,Unnamed: 1_level_1
16855,282
15681,5678
18657,11222
46241,5443
15535,227


In [31]:
#using .join(how='left')
population.join(unemployment)

Unnamed: 0_level_0,census,unemployment,participants
zipcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
16855,282,,
15681,5678,,
18657,11222,,
46241,5443,0.06,243543.0
15535,227,,


In [32]:
#using .join(how='right')
population.join(unemployment, how='right')

Unnamed: 0_level_0,census,unemployment,participants
zip code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
46241,5443.0,0.06,243543
343543,,0.09,646464
34343,,0.17,646444
49463,,0.1,333666


In [34]:
#using .join(how='inner')
population.join(unemployment, how='inner')

Unnamed: 0,census,unemployment,participants
46241,5443,0.06,243543


In [35]:
#using .join(how='outer')
population.join(unemployment, how='outer')

Unnamed: 0,census,unemployment,participants
15535,227.0,,
15681,5678.0,,
16855,282.0,,
18657,11222.0,,
34343,,0.17,646444.0
46241,5443.0,0.06,243543.0
49463,,0.1,333666.0
343543,,0.09,646464.0


### Ordered merges


In [36]:
software = pd.read_csv('feb_sales_Software.csv',parse_dates=['Date']).sort_values('Date')
hardware = pd.read_csv('feb_sales_Hardware.csv',parse_dates=['Date']).sort_values('Date')
software.head(3)

Unnamed: 0,Date,Company,Product,Units
2,2015-02-02 08:33:00,Hooli,Software,3
0,2015-02-16 12:09:00,Hooli,Software,10
8,2015-02-21 05:01:00,Mediacore,Software,3


In [37]:
hardware.head(3)

Unnamed: 0,Date,Company,Product,Units
3,2015-02-02 20:54:00,Mediacore,Hardware,9
2,2015-02-19 10:59:00,Mediacore,Hardware,16
4,2015-02-21 20:41:00,Hooli,Hardware,3


In [38]:
#using merge
pd.merge(hardware,software)

Unnamed: 0,Date,Company,Product,Units


In [39]:
#using merge(how='outer')
pd.merge(hardware,software, how='outer')

Unnamed: 0,Date,Company,Product,Units
0,2015-02-02 20:54:00,Mediacore,Hardware,9
1,2015-02-19 10:59:00,Mediacore,Hardware,16
2,2015-02-21 20:41:00,Hooli,Hardware,3
3,2015-04-02 21:52:00,Acme Coporation,Hardware,14
4,2015-07-02 22:58:00,Acme Coporation,Hardware,1
5,2015-02-02 08:33:00,Hooli,Software,3
6,2015-02-16 12:09:00,Hooli,Software,10
7,2015-02-21 05:01:00,Mediacore,Software,3
8,2015-03-02 14:14:00,Initech,Software,13
9,2015-04-02 15:36:00,Streeplex,Software,13


In [41]:
#using merge(how='outer'), sort!
pd.merge(hardware,software, how='outer').sort_values('Date')

Unnamed: 0,Date,Company,Product,Units
5,2015-02-02 08:33:00,Hooli,Software,3
0,2015-02-02 20:54:00,Mediacore,Hardware,9
6,2015-02-16 12:09:00,Hooli,Software,10
1,2015-02-19 10:59:00,Mediacore,Hardware,16
7,2015-02-21 05:01:00,Mediacore,Software,3
2,2015-02-21 20:41:00,Hooli,Hardware,3
8,2015-03-02 14:14:00,Initech,Software,13
9,2015-04-02 15:36:00,Streeplex,Software,13
3,2015-04-02 21:52:00,Acme Coporation,Hardware,14
10,2015-05-02 01:53:00,Acme Coporation,Software,19


In [42]:
#merge ordered
pd.merge_ordered(hardware,software)

Unnamed: 0,Date,Company,Product,Units
0,2015-02-02 08:33:00,Hooli,Software,3
1,2015-02-02 20:54:00,Mediacore,Hardware,9
2,2015-02-16 12:09:00,Hooli,Software,10
3,2015-02-19 10:59:00,Mediacore,Hardware,16
4,2015-02-21 05:01:00,Mediacore,Software,3
5,2015-02-21 20:41:00,Hooli,Hardware,3
6,2015-03-02 14:14:00,Initech,Software,13
7,2015-04-02 15:36:00,Streeplex,Software,13
8,2015-04-02 21:52:00,Acme Coporation,Hardware,14
9,2015-05-02 01:53:00,Acme Coporation,Software,19


In [43]:
#using on and suffixes
pd.merge_ordered(hardware,software, on=['Date','Company'], suffixes=['_hardware','_software']).head()

Unnamed: 0,Date,Company,Product_hardware,Units_hardware,Product_software,Units_software
0,2015-02-02 08:33:00,Hooli,,,Software,3.0
1,2015-02-02 20:54:00,Mediacore,Hardware,9.0,,
2,2015-02-16 12:09:00,Hooli,,,Software,10.0
3,2015-02-19 10:59:00,Mediacore,Hardware,16.0,,
4,2015-02-21 05:01:00,Mediacore,,,Software,3.0


In [46]:
#stocks data
stocks = pd.read_csv('stocks.csv')
stocks.head(3)

Unnamed: 0,DATE,AAPL,AMGN,AMZN,CPRT,EL,GS,ILMN,MA,PAA,RIO,TEF,UPS
0,04-01-10,30.57,57.72,133.9,4.55,24.27,173.08,30.55,25.68,27.0,56.03,28.55,58.18
1,05-01-10,30.63,57.22,134.69,4.55,24.18,176.14,30.35,25.61,27.3,56.9,28.53,58.28
2,06-01-10,30.14,56.79,132.25,4.53,24.25,174.26,32.22,25.56,27.29,58.64,28.23,57.85


In [47]:
gdp = pd.read_csv('GDP_USA.csv')
gdp.head(3)

Unnamed: 0,DATE,VALUE
0,1947-01-01,243.1
1,1947-04-01,246.3
2,1947-07-01,250.1


In [50]:
#ordered merge
merge_ordered=pd.merge_ordered(stocks, gdp, on='DATE')
merge_ordered.tail(3)

Unnamed: 0,DATE,AAPL,AMGN,AMZN,CPRT,EL,GS,ILMN,MA,PAA,RIO,TEF,UPS,VALUE
2037,31-12-13,80.15,114.08,398.79,9.16,75.32,177.26,110.59,83.55,51.77,56.43,16.34,105.08,
2038,31-12-14,110.38,159.29,310.35,9.12,76.2,193.83,184.58,86.16,51.32,46.06,14.21,111.17,
2039,31-12-15,105.26,162.33,675.89,9.5,88.06,180.23,191.94,97.36,23.1,29.12,11.06,96.23,


In [51]:
#ordered merge with ffill
merge_ordered_ff=pd.merge_ordered(stocks, gdp, on='DATE', fill_method='ffill')
merge_ordered_ff.tail(3)

Unnamed: 0,DATE,AAPL,AMGN,AMZN,CPRT,EL,GS,ILMN,MA,PAA,RIO,TEF,UPS,VALUE
2037,31-12-13,80.15,114.08,398.79,9.16,75.32,177.26,110.59,83.55,51.77,56.43,16.34,105.08,18436.5
2038,31-12-14,110.38,159.29,310.35,9.12,76.2,193.83,184.58,86.16,51.32,46.06,14.21,111.17,18436.5
2039,31-12-15,105.26,162.33,675.89,9.5,88.06,180.23,191.94,97.36,23.1,29.12,11.06,96.23,18436.5


## MERGING EXERCISES

In [None]:
'''
Merging on a specific column
This exercise follows on the last one with the DataFrames revenue and managers for your 
company. You expect your company to grow and, eventually, to operate in cities with the 
same name on different states. As such, you decide that every branch should have a numerical 
branch identifier. Thus, you add a branch_id column to both DataFrames. Moreover, 
new cities have been added to both the revenue and managers DataFrames as well. pandas
has been imported as pd and both DataFrames are available in your namespace.
At present, there should be a 1-to-1 relationship between the city and branch_id fields.
In that case, the result of a merge on the city columns ought to give you the same output
as a merge on the branch_id columns. Do they? Can you spot an ambiguity in one of the DataFrames?

Using pd.merge(), merge the DataFrames revenue and managers on the 'city' column of each. 
Store the result as merge_by_city.
Print the DataFrame merge_by_city. This has been done for you.
Merge the DataFrames revenue and managers on the 'branch_id' column of each. Store
the result as merge_by_id.
Print the DataFrame merge_by_id. This has been done for you, so hit 'Submit Answer'
to see the result!
'''
# Merge revenue with managers on 'city': merge_by_city
merge_by_city = pd.merge(revenue, managers, on='city')

# Print merge_by_city
print(merge_by_city)

# Merge revenue with managers on 'branch_id': merge_by_id
merge_by_id = pd.merge(revenue, managers, on='branch_id')

# Print merge_by_id
print(merge_by_id)

In [None]:
'''
Merging on columns with non-matching labels
You continue working with the revenue & managers DataFrames from before. This time, 
someone has changed the field name 'city' to 'branch' in the managers table. Now, when
you attempt to merge DataFrames, an exception is thrown:
>>> pd.merge(revenue, managers, on='city')
Traceback (most recent call last):
    ... <text deleted> ...
    pd.merge(revenue, managers, on='city')
    ... <text deleted> ...
KeyError: 'city'
Given this, it will take a bit more work for you to join or merge on the city/branch name.
You have to specify the left_on and right_on parameters in the call to pd.merge().
As before, pandas has been pre-imported as pd and the revenue and managers DataFrames are
in your namespace. They have been printed in the IPython Shell so you can examine the 
columns prior to merging.
Are you able to merge better than in the last exercise? How should the rows with Springfield
be handled?

Merge the DataFrames revenue and managers into a single DataFrame called combined using the
'city' and 'branch' columns from the appropriate DataFrames.
In your call to pd.merge(), you will have to specify the parameters left_on and right_on 
appropriately.
Print the new DataFrame combined.
'''
# Merge revenue & managers on 'city' & 'branch': combined
combined = pd.merge(revenue, managers, left_on='city', right_on='branch')

# Print combined
print(combined)

In [None]:
'''
Merging on multiple columns
Another strategy to disambiguate cities with identical names is to add information on the 
states in which the cities are located. To this end, you add a column called state to both 
DataFrames from the preceding exercises. Again, pandas has been pre-imported as pd and the 
revenue and managers DataFrames are in your namespace.
Your goal in this exercise is to use pd.merge() to merge DataFrames using multiple columns
(using 'branch_id', 'city', and 'state' in this case).
Are you able to match all your company's branches correctly?
INSTRUCTIONS
100XP
Create a column called 'state' in the DataFrame revenue, consisting of the list 
['TX','CO','IL','CA'].
Create a column called 'state' in the DataFrame managers, consisting of the list
['TX','CO','CA','MO'].
Merge the DataFrames revenue and managers using three columns :'branch_id', 'city', 
and 'state'. Pass them in as a list to the on paramater of pd.merge()
'''
# Add 'state' column to revenue: revenue['state']
revenue['state'] = ['TX','CO','IL','CA']

# Add 'state' column to managers: managers['state']
managers['state'] = ['TX','CO','CA','MO']

# Merge revenue & managers on 'branch_id', 'city', & 'state': combined
combined = pd.merge(revenue, managers, on=['branch_id', 'city', 'state'])

# Print combined
print(combined)

In [None]:
'''
Left & right merging on multiple columns
You now have, in addition to the revenue and managers DataFrames from prior exercises,
a DataFrame sales that summarizes units sold from specific branches (identified by city 
and state but not branch_id).
Once again, the managers DataFrame uses the label branch in place of city as in the other 
two DataFrames. Your task here is to employ left and right merges to preserve data and 
identify where data is missing.
By merging revenue and sales with a right merge, you can identify the missing revenue values.
Here, you don't need to specify left_on or right_on because the columns to merge on have 
matching labels.
By merging sales and managers with a left merge, you can identify the missing manager. Here, 
the columns to merge on have conflicting labels, so you must specify left_on and right_on. 
In both cases, you're looking to figure out how to connect the fields in rows containing 
Springfield.
pandas has been imported as pd and the three DataFrames revenue, managers, and sales have
been pre-loaded. They have been printed for you to explore in the IPython Shell.
INSTRUCTIONS
100XP
Execute a right merge using pd.merge() with revenue and sales to yield a new DataFrame
revenue_and_sales.
Use how='right' and on=['city', 'state'].
Print the new DataFrame revenue_and_sales. This has been done for you.
Execute a left merge with sales and managers to yield a new DataFrame sales_and_managers.
Use how='left', left_on=['city', 'state'], and right_on=['branch', 'state].
Print the new DataFrame sales_and_managers. This has been done for you, so hit 
'Submit Answer' to see the result!
'''
# Merge revenue and sales: revenue_and_sales
revenue_and_sales = pd.merge(revenue, sales, how='right', on=['city', 'state'])

# Print revenue_and_sales
print(revenue_and_sales)

# Merge sales and managers: sales_and_managers
sales_and_managers = pd.merge(sales, managers, how='left', left_on=['city', 'state'], right_on=['branch', 'state'])

# Print sales_and_managers
print(sales_and_managers)

In [None]:
'''
Merging DataFrames with outer join
This exercise picks up where the previous one left off. The DataFrames revenue, managers,
and sales are pre-loaded into your namespace (and, of course, pandas is imported as pd). 
Moreover, the merged DataFrames revenue_and_sales and sales_and_managers have been
pre-computed exactly as you did in the previous exercise.
The merged DataFrames contain enough information to construct a DataFrame with 5 rows with 
all known information correctly aligned and each branch listed only once. You will try to 
merge the merged DataFrames on all matching keys (which computes an inner join by default).
You can compare the result to an outer join and also to an outer join with restricted subset
of columns as keys.
INSTRUCTIONS
100XP
Merge sales_and_managers with revenue_and_sales. Store the result as merge_default.
Print merge_default. This has been done for you.
Merge sales_and_managers with revenue_and_sales using how='outer'. Store the result as 
merge_outer.
Print merge_outer. This has been done for you.
Merge sales_and_managers with revenue_and_sales only on ['city','state'] using an outer 
join. Store the result as merge_outer_on and hit 'Submit Answer' to see what the merged
DataFrames look like!
'''
# Perform the first merge: merge_default
merge_default = pd.merge(sales_and_managers, revenue_and_sales)

# Print merge_default
print(merge_default)

# Perform the second merge: merge_outer
merge_outer = pd.merge(sales_and_managers, revenue_and_sales, how='outer')

# Print merge_outer
print(merge_outer)

# Perform the third merge: merge_outer_on
merge_outer_on = pd.merge(sales_and_managers, revenue_and_sales, on=['city', 'state'], how='outer')

# Print merge_outer_on
print(merge_outer_on)

In [None]:
'''
Using merge_ordered()
This exercise uses pre-loaded DataFrames austin and houston that contain weather data
from the cities Austin and Houston respectively. They have been printed in the IPython Shell
for you to examine.
Weather conditions were recorded on separate days and you need to merge these two DataFrames
together such that the dates are ordered. To do this, you'll use pd.merge_ordered(). 
After you're done, note the order of the rows before and after merging.
INSTRUCTIONS
100XP
Perform an ordered merge on austin and houston using pd.merge_ordered(). Store the result
as tx_weather.
Print tx_weather. You should notice that the rows are sorted by the date but it is not 
possible to tell which observation came from which city.
Perform another ordered merge on austin and houston.
This time, specify the keyword arguments on='date' and suffixes=['_aus','_hus'] so that
the rows can be distinguished. Store the result as tx_weather_suff.
Print tx_weather_suff to examine its contents. This has been done for you.
Perform a third ordered merge on austin and houston.
This time, in addition to the on and suffixes parameters, specify the keyword argument 
fill_method='ffill' to use forward-filling to replace NaN entries with the most recent
non-null entry, and hit 'Submit Answer' to examine the contents of the merged DataFrames!
'''
# Perform the first ordered merge: tx_weather
tx_weather = pd.merge_ordered(austin, houston)

# Print tx_weather
print(tx_weather)

# Perform the second ordered merge: tx_weather_suff
tx_weather_suff = pd.merge_ordered(austin, houston, on='date', suffixes=['_aus', '_hus'])

# Print tx_weather_suff
print(tx_weather_suff)

# Perform the third ordered merge: tx_weather_ffill
tx_weather_ffill = pd.merge_ordered(austin, houston, on='date', suffixes=['_aus', '_hus'], fill_method='ffill')

# Print tx_weather_ffill
print(tx_weather_ffill)

In [None]:
'''
Using merge_asof()
Similar to pd.merge_ordered(), the pd.merge_asof() function will also merge values in order
using the on column, but for each row in the left DataFrame, only rows from the right
DataFrame whose 'on' column values are less than the left value will be kept.
This function can be use to align disparate datetime frequencies without having to first 
resample.
Here, you'll merge monthly oil prices (US dollars) into a full automobile fuel efficiency
dataset. The oil and automobile DataFrames have been pre-loaded as oil and auto. The first
5 rows of each have been printed in the IPython Shell for you to explore.
These datasets will align such that the first price of the year will be broadcast into
the rows of the automobiles DataFrame. This is considered correct since by the start of 
any given year, most automobiles for that year will have already been manufactured.
You'll then inspect the merged DataFrame, resample by year and compute the mean 'Price'
and 'mpg'. You should be able to see a trend in these two columns, that you can confirm 
by computing the Pearson correlation between resampled 'Price' and 'mpg'.

Merge auto and oil using pd.merge_asof() with left_on='yr' and right_on='Date'. 
Store the result as merged.
Print the tail of merged. This has been done for you.
Resample merged using 'A' (annual frequency), and on='Date'. Select [['mpg','Price']] and
aggregate the mean. Store the result as yearly.
Hit Submit Answer to examine the contents of yearly and yearly.corr(), which shows the
Pearson correlation between the resampled 'Price' and 'mpg'.
'''
# Merge auto and oil: merged
merged = pd.merge_asof(auto, oil, left_on='yr', right_on='Date')

# Print the tail of merged
print(merged.tail())

# Resample merged: yearly
yearly = merged.resample('A',on='Date')[['mpg','Price']].mean()

# Print yearly
print(yearly)

# Print yearly.corr()
print(yearly.corr())

### Case study Summer olympics

In [None]:
#reminder loading and merging files
pd.read_csv()( and many options)
#looping over files e.g.
[pd.read_csv(f) for f in glob('*.csv')]

#concatenating and appending
pd.concat([df1, df2], axis=0)
df1.append(df2)

#### Constructing a pivot table
Apply DataFrame pivot_table() method
index: column to use as index of pivot table
values: columns to aggregate
aggfunc: function to apply for aggregation
columns: categories as columns of pivot table