### Outline
2. Analyzing Data
    - Summary Tables
    - Figures
    - Regression Tables

In [1]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
import statsmodels.formula.api as smf
plt.style.use('seaborn-white') # set a plot style
np.random.seed(123456789) # set a seed

In [2]:
pd.options.display.max_rows = 200 # set # of rows to display 
pd.options.display.max_columns = 100 # set # of columns to display 

In [3]:
os.chdir('...') # set the working directory

In [4]:
data = pd.read_csv('data/data_use.csv') # import data
data = data.sort_values(by=['state_short', 'elec_year']) # sort data
#print(data.dtypes)

### Summary Tables
- Any empirical paper has a summary table.
- Summary tables are very important because people often see the tables to understand the distributions of variables.
- To make a summary table, use `agg`.
- You can produce an output in several format. However, I suggest LaTeX format.
    - The reason: You will import the file later when you write a paper.
- For the moment, we produce files with the `txt` file extension.
    - If you already have TeX in your computer, change the file extension from `txt` to `tex`.

In [8]:
data1 = data.loc[:, ['gelec_dem', 'gelec_rep', 'gelec_oth', 'gelec_total', 'rep_share', 'dem_share', 'elec_year', 'temp_mean', 'temp_max_max', 'temp_max_mean']]
data1.agg(['mean', 'std', 'min', 'max', 'count']).T.to_latex("sum_stat.txt", float_format="%.2f")

### Figures
- Good papers always have figures that summarize the results well.
    - You should not write a paper only with tables.
- A very useful package for plotting figures is `matplotlib`.
- There are several types: histograms, density plots, scatter plots, bar plots, line plots...
    - Each type has different purposes.
- Regarding the format, I suggest either `png` or `jpg`.

#### - Histograms
- Histograms are often used to show the distributions of your data.
- Use `hist` to plot a histogram. There are several options:
    - `bins`: # of bins
    - `align`: alignment
    - `range`: domain
    - `alpha`: transparency
    - `color`: color
    - `label`: label

In [None]:
plt.hist(data['rep_share'][data['rep_share'].notna()], bins=20, align='mid', range=(0,1), alpha=0.3, color='r', label='rep')
plt.hist(data['dem_share'][data['dem_share'].notna()], bins=20, align='mid', range=(0,1), alpha=0.3, color='b', label='dem')
plt.title('Vote share, 2008-2014') # add title
plt.legend(loc='best') # add legend
plt.savefig('hist_rep_share.png', dpi=100) # save file
plt.show() # plot

#### - Density Plots
- Density plots are a variation of histograms.
    - For density plots, you employ some method to smooth the distribution.
- It is not affected by, e.g., how you choose bins.

In [None]:
rdm = pd.Series(np.random.randn(1000)) # generate random values
plt.hist(rdm, bins=100, alpha=0.3, color='b', density=True) # plot a histogram
# rdm.hist(bins=100, alpha=0.3, color='b', density=True) # alternative
rdm.plot.kde(style='k--') # plot a kernel density

#### - Bar Plots
- Bar plots are often used to compare statistics (e.g., mean) for different groups.

In [None]:
dic1 = {'tom':1.75, 'jerry':1.82, 'spike':1.65, 'tyke':1.4}
dic2 = {'tom':1, 'jerry':1, 'spike':0, 'tyke':0}
data1 = pd.DataFrame({'height': dic1, 'treatment': dic2})
print(data1)
plt.bar('treatment', 'height', linewidth=0, data=data1)
plt.xticks([0,1]); plt.xlabel("treatment"); plt.ylabel("average height")

#### - Scatter Plots
- Scatter plots are often used to show a relationship between two samples.

In [None]:
plt.scatter(y=data['rep_share'], x=data['temp_max_mean'])
plt.ylabel("Republican vote share"); plt.xlabel("Mean temperature")

#### - Line Plots
- Line plots are often used to show time trends.

In [None]:
plt.plot(np.random.randn(1000).cumsum(), color='b', label='one') 
plt.plot(np.random.randn(1000).cumsum(), color='r', label='two')
plt.plot(np.random.randn(1000).cumsum(), color='y', label='three')
plt.legend(loc='best')

- Another useful package to plot figures is `seaborn`.
- A [cheat sheet](https://www.datacamp.com/community/blog/seaborn-cheat-sheet-python) is also available.

In [None]:
titanic = sns.load_dataset('titanic') # import titanic data
#print(titanic)
sns.barplot(x='sex', y='survived', data=titanic) # bar plot. also add "hue='class'" option

In [None]:
sns.distplot(data['rep_share'][data['rep_share'].notna()], bins=20) # histogram with a density plot (set "kde=False" if you do not want a density plot)

In [None]:
g = sns.lmplot(x='temp_max_mean', y='rep_share', data=data) # scatter plot with a regression line
g.set_axis_labels("Mean temperature", "Republican vote share").set(xlim=(5, 90), xticks=[10,20,30,40,50,60,70,80])

### Regression Tables
- Empirical papers have regression tables.
- The `statsmodels` package is useful for making regression tables.
    - To run a regression, use `smf` or `sm`.
    - To produce a table of the regression results, use `summary_col`.
- How can you interpret the regression results?

In [None]:
data1 = data.dropna(subset=['rep_share', 'ln_temp_max_mean']) # remove missing values
reg1 = smf.ols('rep_share ~ ln_temp_max_max', data=data1).fit() # OLS
reg2 = smf.ols('rep_share ~ ln_temp_max_max + C(state_short) + C(elec_year)', data=data1).fit() # FE
reg3 = smf.ols('rep_share ~ ln_temp_max_max + C(state_short) + C(elec_year)', data=data1).fit(cov_type='cluster', cov_kwds={'groups':data1['state_short']}) # FE + clustering

results_table = summary_col(results=[reg1, reg2, reg3],
                            float_format='%0.3f',
                            stars=True,
                            model_names=['OLS', 'FE', 'FE+cluster'],
                            info_dict={'R-squared' : lambda x: "{:.3f}".format(x.rsquared),
                                       'Observations' : lambda x: "{0:d}".format(int(x.nobs))},
                            regressor_order=['ln_temp_max_max'],
                            drop_omitted=True) # summarize results

results_table.add_title("Table 1: Correlation between Election Day temperature and Senate Republican vote share") # add title

print(results_table)

file = open('estimates.txt','w') # open a file
file.write(results_table.as_latex()) # save the table
file.close() # close the file