# Analyzing Data using Python
By Shuhei Kitamura

### Outline
1. Preparing Data for Analysis
    - Importing Data
    - Combining Data
    - Reshaping Data  
    - Making Variables
    - Saving Data

In [1]:
import os
import pandas as pd
import numpy as np

In [2]:
pd.options.display.max_rows = 300 # set # of rows to display 
pd.options.display.max_columns = 100 # set # of columns to display 

In [3]:
os.chdir('...') # set the working directory

## 1. Preparing Data for Analysis
- We already have cleaned data. The next step is to make the final data for analysis.
- In this exercise, we will put eight files together. How?
    - 1. Append US Senate election data
    - 2. Append daily temperature data
    - 3. Merge them
- Our goal is to make the data that have a panel structure.

### Importing Data
- Import data as usual. Recall that all files are saved in csv format.

In [4]:
elec_data = {}
temp_data = {}
for year in range(2008,2016,2):
    elec_data['elec_'+str(year)] = pd.read_csv('data/elec_senate_'+str(year)+'.csv', dtype=object)
    temp_data['elec_'+str(year)] = pd.read_csv('data/daily_temp_'+str(year)+'.csv')    

- Check data entries and their types, if you do not know them yet.

### Combining Data
#### - Appending
- Recall that we have used `concatenate` to combine NumPy arrays (see, python_basics_4.ipynb).
- You can do the similar thing using `concat` for Pandas' objects.
    - `concat` can be used for both appending and merging data.
    - `append` and `merge` are also available. We will use `merge` later.
- Append `data1` and `data2`.

In [None]:
data1 = pd.DataFrame(np.random.rand(3,3), columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])
data2 = pd.DataFrame(np.random.rand(3,3), columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])
print(data1); print(data2)
print(pd.concat([data1, data2])); print(pd.concat([data1, data2], ignore_index=True))
print(pd.concat([data1, data2], axis=1))

- What happens if some columns are missing in one of the datasets?
- If you want to keep them, use `outer` option, otherwise use `inner` option.

In [None]:
data1 = pd.DataFrame(np.random.rand(2,2), columns=['var1', 'var2'], index=['a', 'b'])
data2 = pd.DataFrame(np.random.rand(3,3), columns=['var1', 'var2', 'var3'], index=['a', 'b', 'c'])
print(data1); print(data2)
print(pd.concat([data1, data2], ignore_index=True, join='inner'))
print(pd.concat([data1, data2], ignore_index=True, join='outer', sort=False))

- Let's combine all years for election data and temperature data, respectively.
- Wait... but temperature data are already very long (> 250,000 observations).
- Let's reduce the sizes of datasets before appending them.
    - Why didn't we do it when we cleaned data?

#### - Group Aggregation
- Our goal is to keep a single observation for each state and year. 
- How? There are several strategies.
    - Take the mean/max/min/std, etc.
    - Keep some of the observations
    - Reshape data
- I suggest the following procedure:
    1. Keep the Election Day temperature
    2. Take the mean of `'arithmetic_mean'` (daily average) and the max and mean of `'1st_max_value'` (daily max) for each state.
- Does it make sense to you?

- Let's keep the Election Day temperature for each year.
    - Election Day: November 4th, 2008, November 2nd, 2010, November 6th, 2012, November 8th, 2014

In [5]:
temp_data['elec_2008'] = temp_data['elec_2008'].loc[temp_data['elec_2008']['date_local'] == '2008-11-04',]
temp_data['elec_2010'] = temp_data['elec_2010'].loc[temp_data['elec_2010']['date_local'] == '2010-11-02',]
temp_data['elec_2012'] = temp_data['elec_2012'].loc[temp_data['elec_2012']['date_local'] == '2012-11-06',]
temp_data['elec_2014'] = temp_data['elec_2014'].loc[temp_data['elec_2014']['date_local'] == '2014-11-08',]

- Next, check whether any columns of interest have missing values before aggregating their values.
    - Recall that some computation methods do not igore missing values.

- A powerful method for aggregation is `groupby`.
- If you get an error when executing the following code, you need to update `pandas`.

In [6]:
for year in range(2008,2016,2):
    temp_data['elec_'+str(year)+'_agg'] = temp_data['elec_'+str(year)].groupby('state_name').agg(temp_mean=('arithmetic_mean', np.mean), temp_max_max=('1st_max_value', np.max), temp_max_mean=('1st_max_value', np.mean)).reset_index()
    temp_data['elec_'+str(year)+'_agg']['elec_year'] = str(year)

- Finally, it's time to append data.

In [7]:
elec_all = elec_data['elec_2008'] 
temp_all = temp_data['elec_2008_agg']
for year in range(2010,2016,2):
    elec_all = pd.concat([elec_all, elec_data['elec_'+str(year)]], ignore_index=True)
    temp_all = pd.concat([temp_all, temp_data['elec_'+str(year)+'_agg']], ignore_index=True)

- Check that each dataset contains state names and election years.
- Do you notice something?

In [None]:
print(elec_all['state_long'].unique()); print(elec_all['elec_year'].unique())
print(temp_all['state_name'].unique()); print(temp_all['elec_year'].unique())

#### - Removing spaces
- Some strings contain strange spaces in `elec_all`. This is often the case.
- We have to remove it. Otherwise, we will not be able to merge data properly later.
- There are several ways to remove spaces. However, in this case, it's unwise to use `.str.strip()` or `.str.replace(" ","")`, which removes all spaces in a given string. Why?
- Let's remove the right-side spaces using `str.rstrip()`.
    - To remove the left-side spaces, use `str.lstrip()`.

In [8]:
elec_all['state_long'] = elec_all['state_long'].str.rstrip() 

- What about now?

In [None]:
print(elec_all['state_long'].unique()); print(elec_all['elec_year'].unique())
print(temp_all['state_name'].unique()); print(temp_all['elec_year'].unique())

#### - Merging
- Merging means that you combine data horizontally.
- To merge Pandas' objects, use `merge`.
    - You can also use `concat` but `merge` seems to be more flexible and intuitive.
- Merge `data1` and `data2`.

In [None]:
data1 = pd.DataFrame([['tom', 9], ['jerry', 12]], columns=['name', 'educ'], index=['a', 'b'])
data2 = pd.DataFrame([['tom', 185, 70], ['jerry', 170, 62], ['spike', 165, 60]], columns=['name', 'height', 'weight'], index=['a', 'b', 'c'])
print(data1); print(data2)
print(pd.merge(data1, data2, on='name')) # inner join (intersection)
print(pd.merge(data1, data2, on='name', how='outer')) # outer join (union)
print(pd.merge(data1, data2, on='name', how='right'))  # right join (keep right data)
print(pd.merge(data1, data2, on='name', how='left')) # left join (keep left data)

- The keys can have different names.

In [None]:
data1 = pd.DataFrame([['tom', 9], ['jerry', 12]], columns=['name1', 'educ'], index=['a', 'b'])
data2 = pd.DataFrame([['tom', 185, 70], ['jerry', 170, 62], ['spike', 165, 60]], columns=['name2', 'height', 'weight'], index=['a', 'b', 'c'])
print(data1); print(data2)
print(pd.merge(data1, data2, left_on='name1', right_on='name2', how='outer')) 

- You can use more than one key.

In [None]:
data1 = pd.DataFrame([['tom', 2000, 9], ['jerry', 2000, 12]], columns=['name', 'year', 'educ'], index=['a', 'b'])
data2 = pd.DataFrame([['tom', 2000, 185, 70], ['jerry', 2000, 170, 62], ['tom', 2001, 187, 75], ['jerry', 2001, 171, 63]], columns=['name', 'year', 'height', 'weight'], index=['a', 'b', 'c', 'd'])
print(data1); print(data2)
print(pd.merge(data1, data2, on=['name', 'year'], how='outer'))

- What happens if two datasets have the same column name with different values?

In [None]:
data1 = pd.DataFrame([['tom', 185], ['jerry', 170]], columns=['name', 'height'], index=['a', 'b'])
data2 = pd.DataFrame([['tom', 185, 70], ['jerry', 172, 62]], columns=['name', 'height', 'weight'], index=['a', 'b'])
print(data1); print(data2)
print(pd.merge(data1, data2, on='name'))

- Let's merge election and temperature data.
    - What are the keys?

In [10]:
data_use = pd.merge(elec_all, temp_all, left_on=['state_long', 'elec_year'], right_on=['state_name', 'elec_year'], how='outer')

- Print `data_use`, which should be in a long format. You will often use this type of data structure for the panel data analysis.
- What do you think a wide format may look like?

### Reshaping Data
- If necessary, reshape data for analysis. In that case, use `pivot`.
    - In our case, we don't need to reshape the data.

In [22]:
data_use_to_reshape = data_use.loc[data_use['state_long'].notna(),]
data1 = data_use_to_reshape.pivot(index='state_long', columns='elec_year') # reshape to a wide format
data2 = data1.stack() # back to a long format
data3 = data2.unstack() # back to a wide format

### Making Variables
- You may need more variables for analysis. For example:
    - Logarithm
    - Total, mean, min, max...
    - Share, ratio...
- Let's make 
    - Vote share of Republican and Democratic candicates
    - Natural logarithm of temperature    

In [11]:
# make vote shares
data_use['gelec_total'] = data_use['gelec_dem'].astype(float).fillna(0) + data_use['gelec_rep'].astype(float).fillna(0) + data_use['gelec_oth'].astype(float).fillna(0)
#print(data_use.loc[data_use['gelec_total'] == 0.0,])
data_use.loc[data_use['gelec_total'] == 0.0, 'gelec_total'] = np.nan # replace to NaN if the total vote is zero
data_use['rep_share'] = data_use['gelec_rep'].astype(float) / data_use['gelec_total'] # republican vote share
data_use['dem_share'] = data_use['gelec_dem'].astype(float) / data_use['gelec_total'] # democrat vote share
#print(data_use.loc[:, ['gelec_dem', 'gelec_rep', 'gelec_oth', 'gelec_total', 'rep_share']])

In [None]:
# take natural logs
data_use['ln_temp_mean'] = np.log(data_use['temp_mean'])
data_use['ln_temp_max_max'] = np.log(data_use['temp_max_max'])
data_use['ln_temp_max_mean'] = np.log(data_use['temp_max_mean'])

### Saving Data

In [13]:
data_use.to_csv('data/data_use.csv', index=False)