# Cleaning Data using Python
By Shuhei Kitamura

### Outline
Research Question: Is there any relationship between Election-Day temperature and electoral outcomes?

Let's clean two types of datasets (Senate [election results](https://transition.fec.gov/pubrec/electionresults.shtml) and [daily temperature](https://aqs.epa.gov/aqsweb/airdata/download_files.html)).

1. Importing Data
2. Treating Missing Values
3. Keeping Columns
4. Keeping Rows
5. Treating Other Values
6. Saving Data

In [1]:
import os
import pandas as pd
import numpy as np

In [2]:
pd.options.display.max_rows = 200 # set # of rows to display 
pd.options.display.max_columns = 100 # set # of columns to display 

In [3]:
os.chdir('...') # set the working directory

## 1. Importing Data
- Data are saved in different formats. Typical file extensions are: csv, tsv, and xlsx.
    - The way you import data depends on the format of a data file.
- For csv (comma separated data) files, use `pd.read_table()` or `pd.read_csv()`.
    - Be sure to include `sep=','` option if you use `pd.read_table()`.
    - If you add `index_col=0` option, Pandas recognizes that the first column contains column names.
- For tsv (tab separated data) files, use `pd.read_table()` or `pd.read_csv()`.
- For excel files, use `pd.read_excel()`.
- Datasets
    - `elec_senate.xlsx`: US Senate general election results 2008-2014
    - `daily_TEMP_XXXX.csv`: US temperature 2008-2014

In [4]:
elec_data = {}
temp_data = {}
for year in range(2008,2016,2):
    elec_data['elec_'+str(year)] = pd.read_excel('data/elec_senate.xlsx', sheet_name=str(year), dtype=object)
    temp_data['elec_'+str(year)] = pd.read_csv('data/daily_TEMP_'+str(year)+'.csv', dtype=object)    

### Checking column names and data entries
- The very first thing to do after importing data is to check the column names (variables) of the data.
- What is the type of `elec_data` and `elec_data['elec_2008']`?
- Print the list of columns in `elec_data['elec_2008']`. Hint: Use `.columns`.
- To see some samples of the data, use `.head(#)` or `.tail(#)`, where `#` means the number of rows.
- What are the types of columns?
    - Hint: Use `.dtypes`.

### Checking keys
- Next, check whether the data have a unique and non-missing key.
- Does `elec_data['elec_2008']` have such a key?
    - To get the index, use `.index`.
    - To check the uniqueness, use `.is_unique`.
    - To check the non-missingness, use `.isna().any()`.

## 2. Treating Missing Values
- Print `elec_data['elec_2008']`. Why some values are missing?
- There are several strategies to handle missing data:
    - 1. Drop them
    - 2. Replace it with a sentinel value (e.g., -999)
    - 3. Do nothing (decide later)

In [None]:
elec_data['elec_2008']

### What are missing values in Python? 
- There are two types.
    - `None`: The absence of a value.
    - `NaN` (Not a Number): A missing floating-point value.
    - (`inf`: an infinite number.)
- Print arrays in the below example.
    - `array1` and `array2` look the same, but check whether they actually are using `is` and `==`.
    - `None` appears as either `None` (for `object`, `bool`, and `str`) or `NaN` (for most other types).
- What happens if you aggregate all items in each array?
    - Try both `np.sum()` and `sum()`.

In [7]:
array1 = pd.Series([1, 2, 3, None])
array2 = pd.Series([1, 2, 3, np.nan])
array3 = pd.Series([1, 2, 3, None], dtype=object) 
array4 = pd.Series([1, 2, 3, np.inf])

### Handling Missing Values
- Useful methods
    - For checking: `.isna()`, `.notna()`
    - For deleting: `.dropna()`
    - For replacing: `.fillna()`
- Also, you may combine them with `.any()` or `.all()`.
- Check and replace missing values with zeros in `array1`.
- Drop missing values in `data1`.
    - Try `dropna(how='all',axis='columns')`, `dropna(how='all',axis='rows')`, and `dropna(how='any',axis='rows')`.

In [8]:
array1 = pd.Series([1.0, np.nan, 3.0, None])
data1 = pd.DataFrame([[1.0, np.nan, 3.0],[4.0, 5.0, None],[7.0, 8.0, 9.0],[np.nan, np.nan, np.nan]])

- You can also use `.notna()` instead of `dropna()` for subsetting.

In [9]:
array1 = pd.Series([1.0, np.nan, 3.0, None])
array1 = array1[array1.notna()]

## 3. Keeping Columns
- Data may contain some redundant columns that will never be used in analysis. We will drop such columns to reduce the data size.
    1. A column whose values are all missing
    2. A column whose information is not important
- You should be very careful about deciding which columns to keep/drop.
    - You may not use those columns in the current project but may use them in another project.
- Let's check `elec_data['elec_2008']`.

In [None]:
elec_data['elec_2008']

- All columns look good. What about `temp_data['elec_2008']`?

In [None]:
temp_data['elec_2008']['Parameter Name'].unique()

- Let's drop `Parameter Name`, `Sample Duration`, `Pollutant Standard`, `Units of Measure`, and `AQI`.
- There are several ways to remove columns that are all missing. For example:

In [5]:
for year in range(2008,2016,2):
    temp_data['elec_'+str(year)+'_keep'] = temp_data['elec_'+str(year)].dropna(how='all',axis='columns')
#temp_data['elec_2008_keep']
#temp_data.keys()

- Next, drop remaining columns.

In [6]:
for year in range(2008,2016,2):
    temp_data['elec_'+str(year)+'_keep'] = temp_data['elec_'+str(year)+'_keep'].drop(['Parameter Name', 'Sample Duration', 'Units of Measure'],axis=1)

### Changing Column Names
- Once we have all columns we need, it is time to modify column names. 
- If column names are very long, use uppercase, have space(s), or written in any other non-generic format, we need to change them.
- To simplify the process, we just replace spaces with underscores and change uppercase to lowercase.
    - Some column names may still be too long.

In [None]:
temp_data['elec_2008_keep'].columns

In [None]:
for year in range(2008,2016,2):
    temp_data['elec_'+str(year)+'_keep'].columns = temp_data['elec_'+str(year)+'_keep'].columns.str.replace(' ', '_')
    temp_data['elec_'+str(year)+'_keep'].columns = [x.lower() for x in temp_data['elec_'+str(year)+'_keep'].columns]
temp_data['elec_2008_keep'].columns

## 4. Keeping Rows
- Since columns are all good by now, we move on to rows.
- We need to decide how much rows to keep in the final data.
- Things to consider:
    - Should we keep rows that are all missing?
    - If there are multiple entries per unit of observation, should we keep all of them?

- First, let's look at `elec_data['elec_2008']` again. Recall that all election results are missing for some rows.

In [None]:
elec_data['elec_2008']

- Let's drop such rows.

In [None]:
for year in range(2008,2016,2):
    elec_data['elec_'+str(year)+'_keep'] = elec_data['elec_'+str(year)].dropna(how='all',subset=['gelec_dem', 'gelec_rep', 'gelec_oth'])
elec_data['elec_2008_keep']

- Next, print `temp_data['elec_2008_keep']['date_local']` for New York using `'state_name'`.
    - What does the column `'date_local'` mean?
    - Should we drop all dates that are not needed for the analysis? Election Days are:
        - November 4th, 2008, November 2nd, 2010, November 6th, 2012, November 8th, 2014

In [None]:
temp_data['elec_2008_keep']['date_local'].loc[temp_data['elec_2008_keep']['state_name'] == 'New York',]

- I suggest we keep all dates. 
    - The reason: We may use daily temperature on non-election days as well.
- Next, print `temp_data['elec_2008_keep']` for New York and `2008-11-04`, i.e., Election Day.
    - Are there multiple entries? Why so?

In [None]:
temp_data['elec_2008_keep'].loc[(temp_data['elec_2008_keep']['state_name'] == 'New York') & (temp_data['elec_2008_keep']['date_local'] == '2008-11-04'),]

- What should we do? There are several options. What are the pros and cons of each method?
    - 1. Aggregate data
    - 2. Reshape data
    - 3. Delete some observations
    - 4. Do nothing (decide later)

- There is no right answer!
- Let's take the fourth option for now. Two reasons: 
    - There is no point of losing rich information in data (regarding 1 and 3). 
    - There will be many missing values if we reshape them (regarding 2).

## 5. Treating Other Values
- It could often be the case that you need to modify entries in a dataset "by hand".
    - Example: some numeric values are written in strings.
- The bottom line: Keep code for whatever you do for original data if possible.

## 6. Saving Data
- You can save data in any format. However, the most preferred one is csv. Why?
    - Since csv is just a text file, it can be read in any text editor. Easy to share.
    - By contrast, if you do not want to lose information about formatting, macros, etc., it may be better to save it in excel format.
- To save data as a csv file, use `mydata.to_csv`.
    - Use `mydata.to_excel` for saving in excel format.

In [9]:
for year in range(2008,2016,2):
    elec_data['elec_'+str(year)+'_keep'].to_csv('data/elec_senate_'+str(year)+'.csv', index=False)
    temp_data['elec_'+str(year)+'_keep'].to_csv('data/daily_temp_'+str(year)+'.csv', index=False)