# Analyzing Data using Python
By Shuhei Kitamura

### Outline<a id='top'></a>
1. [Summary Tables](#sec1)
2. [Figures](#sec2)
3. [Regression Tables](#sec3)

In [None]:
# import packages and modules
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
import statsmodels.formula.api as smf

In [None]:
# set some options (not necessary)
pd.options.display.max_rows = 200 # set the max number of rows to display 
pd.options.display.max_columns = 100 # set the max number of columns to display 
plt.style.use('seaborn-white') # set a plot style

In [None]:
# set the working directory (if necessary)
# os.chdir('...') # replace '...' with the location of the working directory

In [None]:
data = pd.read_csv('data_use.csv') # import data
print(len(data))
data = data.loc[data['gelec_total']!=0, ] # drop all missing rows
print(len(data))

## 1. Summary Tables<a id='sec1'></a>
- A summary table contains the statistics of your data such as mean, std, and the number of observation.
- Summary tables are very useful because people can understand the details of your data.
- To make a summary table, use `object.agg()`.
- You can produce an output table in several format. I suggest the LaTeX format.
    - The reason: You can compile the table easily when you write a paper.
- For the moment, we produce files with the `txt` file extension.
    - If you already have TeX in your computer, change the file extension from `txt` to `tex`.
            
[back to top](#top)

In [None]:
data1 = data.loc[:, ['gelec_dem', 'gelec_rep', 'gelec_oth', 'gelec_total', 'rep_share', 'dem_share', 'elec_year', 'temp_mean', 'temp_max_max', 'temp_max_mean']] # select columns
data1.agg(['mean', 'std', 'min', 'max', 'count']).T.to_latex("sum_stat.txt", float_format="%.2f")

## 2. Figures<a id='sec2'></a>
- Good papers always have figures that summarize results well.
    - You should not write a paper only with boring tables!
- A very useful package for plotting figures is `matplotlib`.
- There are several types of figures: histograms, density plots, scatter plots, bar plots, line plots...
    - Each type has different purposes.
- Regarding the format, I suggest either `png` or `jpg`.
        
[back to top](#top)

### Histograms
- Histograms are often used to show the distributions of your data.
- Use `plt.hist` to plot a histogram. There are several options:
    - `bins`: # of bins
    - `align`: alignment
    - `range`: domain
    - `alpha`: transparency
    - `color`: color
    - `label`: label

In [None]:
plt.hist(data['rep_share'][data['rep_share'].notna()], bins=[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], align='mid', alpha=0.3, color='r', label='rep') # make a histogram
plt.hist(data['dem_share'][data['dem_share'].notna()], bins=[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], align='mid', alpha=0.3, color='b', label='dem')
plt.title('Vote share, 2008-2014') # add a title
plt.legend(loc='best') # add a legend
plt.savefig('hist_rep_share.png', dpi=100) # save the file
plt.show() # plot

### Density Plots
- Density plots are a variation of histograms.
    - For density plots, you employ a specific method to smooth the distribution.
    - It is not affected by, e.g., how you choose bins.
- Use `object.plot.kde()` to plot a kernel density.    

In [None]:
r = np.random.RandomState(123456789) # set a seed
rdm = pd.Series(r.randn(1000)) # generate random values
plt.hist(rdm, bins=100, alpha=0.3, color='b', density=True) # make a histogram
#rdm.hist(bins=100, alpha=0.3, color='b', density=True) # an alternative way of making a histogram
rdm.plot.kde(style='k--') # plot a kernel density

### Bar Plots
- Bar plots are often used to compare statistics (e.g., mean) for different groups.
- Use `plt.bar` to plot a histogram.

In [None]:
dict1 = {'tom':1.75, 'jerry':1.82, 'spike':1.65, 'tyke':1.4}
dict2 = {'tom':1, 'jerry':1, 'spike':0, 'tyke':0}
df1 = pd.DataFrame({'height': dict1, 'treatment': dict2})
print(df1)
plt.bar('treatment', 'height', linewidth=0, data=df1)
plt.xticks([0,1]) # set xticks
plt.xlabel("treatment") # add a xlabel
plt.ylabel("average height") # add a ylabel
plt.show()

### Scatter Plots
- Scatter plots are often used to show a relationship between two samples.
- Use `plt.scatter` to plot a scatter plot.

In [None]:
plt.scatter(y=data['rep_share'], x=data['temp_max_mean'])
plt.ylabel("Republican vote share")
plt.xlabel("Mean temperature")
plt.show()

### Line Plots
- Line plots are often used to show time trends.
- Use `plt.plot` to plot a line plot.

In [None]:
r = np.random.RandomState(123456789) # set a seed
plt.plot(r.randn(1000).cumsum(), color='b', label='one') 
plt.plot(r.randn(1000).cumsum(), color='r', label='two')
plt.plot(r.randn(1000).cumsum(), color='y', label='three')
plt.legend(loc='best')

- Another useful package to make figures is `seaborn`.
- A [cheat sheet](https://www.datacamp.com/community/blog/seaborn-cheat-sheet-python) is also available.

In [None]:
titanic = sns.load_dataset('titanic') # import titanic data
#print(titanic)
sns.barplot(x='sex', y='survived', data=titanic) # make a bar plot. also add "hue}='class'" option

In [None]:
sns.distplot(titanic['age'], bins=50) # histogram with a density plot (set "kde=False" if you do not want a density plot)

In [None]:
g = sns.lmplot(x='temp_max_mean', y='rep_share', data=data) # scatter plot with a regression line
g.set_axis_labels("Mean temperature", "Republican vote share")

## 3. Regression Tables<a id='sec3'></a>
- The `statsmodels` package is useful for making regression tables.
    - To run a regression, use `smf` or `sm`.
    - To produce a table of the regression results, use `summary_col`.
- How can you interpret the regression results?
        
[back to top](#top)

In [None]:
data1 = data.dropna(subset=['rep_share', 'ln_temp_max_mean']) # remove missing values
reg1 = smf.ols('rep_share ~ ln_temp_max_max', data=data1).fit() # OLS
reg2 = smf.ols('rep_share ~ ln_temp_max_max + C(state_short) + C(elec_year)', data=data1).fit() # FE
reg3 = smf.ols('rep_share ~ ln_temp_max_max + C(state_short) + C(elec_year)', data=data1).fit(cov_type='cluster', cov_kwds={'groups':data1['state_short']}) # FE + clustering

results_table = summary_col(results=[reg1, reg2, reg3],
                            float_format='%0.3f', # set how many decimals you want to report
                            stars=True, # add stars
                            model_names=['OLS', 'FE', 'FE+cluster'], # add model names
                            info_dict={'R-squared' : lambda x: "{:.3f}".format(x.rsquared), # add R-squared and observations
                                       'Observations' : lambda x: "{0:d}".format(int(x.nobs))},
                            regressor_order=['ln_temp_max_max'],
                            drop_omitted=True) 

results_table.add_title("Table 1: Correlation between Election Day temperature and Senate Republican vote share") # add a title

print(results_table)

file = open('estimates.txt','w') # open a file
file.write(results_table.as_latex()) # save the table
file.close() # close the file