# Making Final Data using Python
By Shuhei Kitamura

- We already have cleaned data. The next step is to make the final data for analysis.
- In this exercise, we will put eight files together. How?
    1. Append US Senate election data
    2. Append daily temperature data
    3. Merge them
- Our goal is to make the final data that have a panel structure.

### Outline<a id='top'></a>
1. [Importing Data](#sec1)
2. [Combining Data](#sec2)
3. [Reshaping Data](#sec3)  
4. [Making Variables](#sec4)
5. [Saving Data](#sec5)

In [None]:
# import packages and modules
import os
import pandas as pd
import numpy as np

In [None]:
# set the display options (not necessary)
pd.options.display.max_rows = 200 # set the max number of rows to display 
pd.options.display.max_columns = 100 # set the max number of columns to display 

In [None]:
# set the working directory (if necessary)
# os.chdir('...') # replace '...' with the location of the working directory

## 1. Importing Data<a id='sec1'></a>
- Import data as usual. Recall that all files are saved in csv format.

[back to top](#top)

In [None]:
elec_data = {}
temp_data = {}
for year in range(2008,2016,2):
    elec_data[str(year)] = pd.read_csv('us_senate_'+str(year)+'.csv')
    temp_data[str(year)] = pd.read_csv('us_daily_temp_'+str(year)+'.csv')    

## 2. Combining Data<a id='sec2'></a>
### Appending
- Recall that we have used `pd.concat` to combine Pandas objects.
    - `pd.concat` can be used for both appending (`axis=0`) and merging (`axis=1`) data.
    - `object.append` and `pd.merge` are also available. We will use `pd.merge` later.
- Run the following code to append `df1` and `df2` using `pd.concat`.

[back to top](#top)

In [None]:
df1 = pd.DataFrame(np.random.rand(3,3), columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])
df2 = pd.DataFrame(np.random.rand(3,3), columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])
print(df1); print(df2)
print(pd.concat([df1, df2], axis=0)); print(pd.concat([df1, df2], axis=0, ignore_index=True))
print(pd.concat([df1, df2], axis=1))

- What happens if some columns (rows) are missing in one of the datasets when appending (merging) them?
- If you want to keep them, use `join='outer'` option, otherwise use `join='inner'` option.
    - If you get a warning message by adding the `'outer'` option, add the `sort=False` option.

In [None]:
df1 = pd.DataFrame(np.random.rand(2,2), columns=['var1', 'var2'], index=['a', 'b'])
df2 = pd.DataFrame(np.random.rand(3,3), columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])
print(df1); print(df2)
print(pd.concat([df1, df2], axis=0, ignore_index=True, join='inner'))
print(pd.concat([df1, df2], axis=0, ignore_index=True, join='outer', sort=False))

- Let's combine all years for election data and temperature data, respectively.
- Wait... but temperature data are already very long (> 250,000 observations).
- Let's reduce the sizes of the datasets before appending them.
    - Why didn't we do it when we cleaned data?

### Group aggregation
- Our goal is to keep a single observation for each state and year. 
- How? There are several strategies.
    - Compute mean/max/min/std, etc. (e.g., mean temperature for California in 2008, etc.)
    - Keep some of the observations
    - Reshape data
- I suggest the following procedure:
    1. Keep the Election Day temperature
    2. Take the mean of `'arithmetic_mean'` (daily average temperature) and the max and mean of `'1st_max_value'` (daily max temperature).
- Do they make sense to you?

- Let's keep the Election Day temperature for each year.
    - Election Day: November 4th, 2008, November 2nd, 2010, November 6th, 2012, November 8th, 2014

In [None]:
temp_data['2008'] = temp_data['2008'].loc[temp_data['2008']['date_local'] == '2008-11-04', ]
temp_data['2010'] = temp_data['2010'].loc[temp_data['2010']['date_local'] == '2010-11-02', ]
temp_data['2012'] = temp_data['2012'].loc[temp_data['2012']['date_local'] == '2012-11-06', ]
temp_data['2014'] = temp_data['2014'].loc[temp_data['2014']['date_local'] == '2014-11-08', ]

- Next, let's aggregate values. A powerful method for aggregation is `object.groupby()`.
    - The level of aggregation is specified in `.groupby()`.
    - Then, apply `object.agg(new_name=(column, func))` in which `func` is the function you want to apply for `column`, and `new_name` is the name of a new variable.
    - Finally, add `object.reset_index()` to reset the indices.

In [None]:
for year in range(2008,2016,2):
    temp_data[str(year)+'_agg'] = temp_data[str(year)].groupby('state_name').agg(temp_mean=('arithmetic_mean', np.mean), temp_max_max=('1st_max_value', np.max), temp_max_mean=('1st_max_value', np.mean)).reset_index()
    temp_data[str(year)+'_agg']['elec_year'] = np.int64(year) # add a year variable

- Finally, it's time to append all the data.

In [None]:
elec_all = elec_data['2008']
temp_all = temp_data['2008_agg']
for year in range(2010,2016,2):
    elec_all = pd.concat([elec_all, elec_data[str(year)]], axis=0, ignore_index=True)
    temp_all = pd.concat([temp_all, temp_data[str(year)+'_agg']], axis=0, ignore_index=True)

### Merging
- Merging means that you combine data horizontally.
- To merge Pandas objects, use `pd.merge`.
    - You can choose from inner join, outer join, right join, and left join. 
    - Use the id for `on` (e.g. `on=id`).
    - You can also use `pd.concat` but `pd.merge` seems to be more flexible and intuitive.
- Run the following code to merge `df1` and `df2`.

In [None]:
df1 = pd.DataFrame([["tom", 9], ["jerry", 12]], columns=['name', 'educ'], index=['a', 'b'])
df2 = pd.DataFrame([["tom", 185, 70], ["jerry", 170, 62], ["spike", 165, 60]], columns=['name', 'height', 'weight'], index=['a', 'b', 'c'])
print(df1); print(df2)
print(pd.merge(df1, df2, on='name')) # inner join (intersection). name is the id
print(pd.merge(df1, df2, on='name', how='outer')) # outer join (union)
print(pd.merge(df1, df2, on='name', how='right'))  # right join (keep right data)
print(pd.merge(df1, df2, on='name', how='left')) # left join (keep left data)

- Id names can be different.

In [None]:
df1 = pd.DataFrame([["tom", 9], ["jerry", 12]], columns=['name1', 'educ'], index=['a', 'b'])
df2 = pd.DataFrame([["tom", 185, 70], ["jerry", 170, 62], ["spike", 165, 60]], columns=['name2', 'height', 'weight'], index=['a', 'b', 'c'])
print(df1); print(df2)
print(pd.merge(df1, df2, left_on='name1', right_on='name2', how='outer')) 

- You can use more than one id.

In [None]:
df1 = pd.DataFrame([["tom", 2000, 9], ["jerry", 2000, 12]], columns=['name', 'year', 'educ'], index=['a', 'b'])
df2 = pd.DataFrame([["tom", 2000, 185, 70], ["jerry", 2000, 170, 62], ["tom", 2001, 187, 75], ["jerry", 2001, 171, 63]], columns=['name', 'year', 'height', 'weight'], index=['a', 'b', 'c', 'd'])
print(df1); print(df2)
print(pd.merge(df1, df2, on=['name', 'year'], how='outer'))

- What happens if two datasets have the same column name with different values?

In [None]:
df1 = pd.DataFrame([["tom", 185], ["jerry", 170]], columns=['name', 'height'], index=['a', 'b'])
df2 = pd.DataFrame([["tom", 185, 70], ["jerry", 172, 62]], columns=['name', 'height', 'weight'], index=['a', 'b'])
print(df1); print(df2)
print(pd.merge(df1, df2, on='name'))

- Let's merge election and temperature data.
- Before doing so:
    - Check which variable is an id. (There can be more than one id.)
    - Check that the ids are unique in at least one of the data.
- Which variables are ids in election and temperature data?

- Check that state names and election years look fine before merging.
- Run the following code.

In [None]:
print(elec_all['state_long'].unique()); print(elec_all['elec_year'].unique())
print(temp_all['state_name'].unique()); print(temp_all['elec_year'].unique())

- Alas, some strings in `elec_all` contain strange spaces.
- We have to remove it. Otherwise, we will not be able to merge data properly.
- There are several ways to remove spaces. However, in this case, it's unwise to use `object.str.strip()` or `object.str.replace(" ","")`, which removes all spaces in a given string.
- Let's remove the right-side spaces using `object.str.rstrip()`.
    - To remove the left-side spaces, use `object.str.lstrip()`.

In [None]:
elec_all['state_long'] = elec_all['state_long'].str.rstrip() 

- What about now?

In [None]:
print(elec_all['state_long'].unique()); print(elec_all['elec_year'].unique())
print(temp_all['state_name'].unique()); print(temp_all['elec_year'].unique())

- Finally, check that each corresponding ids share the same type.

In [None]:
print(type(elec_all['state_long'][0])); print(type(temp_all['state_name'][0]))
print(type(elec_all['elec_year'][0])); print(type(temp_all['elec_year'][0]))

- Now, let's merge the two datasets.

In [None]:
data_use = pd.merge(elec_all, temp_all, left_on=['state_long', 'elec_year'], right_on=['state_name', 'elec_year'], how='outer')

- Print `data_use`, which should be in the long format. You will often use this type of data structure in the panel data analysis.

## 3. Reshaping Data<a id='sec3'></a>
- If necessary, you can also reshape data. In that case, use `object.pivot`.
    - In our case, we don't need to reshape the data.
    
[back to top](#top)

In [None]:
data_use_to_reshape = data_use.loc[data_use['state_long'].notna(),]
data1 = data_use_to_reshape.pivot(index='state_long', columns='elec_year') # reshape to a wide format
data2 = data1.stack() # back to a long format
data3 = data2.unstack() # back to a wide format

## 4. Making Variables<a id='sec4'></a>
- You may need more variables for analysis. For example:
    - Logarithm
    - Total, mean, min, max...
    - Share, ratio...
- Let's make 
    - Vote share of Republican and Democratic candicates
    - Natural logarithm of temperature
        
[back to top](#top)

In [None]:
# make vote shares
data_use['gelec_total'] = data_use[['gelec_dem', 'gelec_rep', 'gelec_oth']].sum(axis=1) # total vote
data_use['rep_share'] = data_use['gelec_rep'] / data_use['gelec_total'] # republican vote share
data_use['dem_share'] = data_use['gelec_dem'] / data_use['gelec_total'] # democrat vote share

# take natural logs
data_use['ln_temp_mean'] = np.log(data_use['temp_mean'])
data_use['ln_temp_max_max'] = np.log(data_use['temp_max_max'])
data_use['ln_temp_max_mean'] = np.log(data_use['temp_max_mean'])

## 5. Saving Data<a id='sec5'></a>
        
[back to top](#top)

In [None]:
data_use.to_csv('data_use.csv', index=False)