# Cleaning Data using Python
By Shuhei Kitamura

**Research Question**: Is there any relationship between Election-Day temperature and electoral outcomes?

- To examine the question, we need to:
    - Obtain data or create data by yourself
    - Clean data
- In this course, we use two datasets
    - [Senate election results](https://transition.fec.gov/pubrec/electionresults.shtml) in the United States
    - [Daily temperature](https://aqs.epa.gov/aqsweb/airdata/download_files.html)
- We start by cleaning the datasets.

### Outline<a id='top'></a>
1. [Importing Data](#sec1)
2. [Treating Missing Values](#sec2)
3. [Selecting Columns](#sec3)
4. [Selecting Rows](#sec4)
5. [Treating Other Values](#sec5)
6. [Saving Data](#sec6)

In [2]:
# import packages and modules
import os
import pandas as pd
import numpy as np

In [3]:
# set the display options (not necessary)
pd.options.display.max_rows = 200 # set the max number of rows to display 
pd.options.display.max_columns = 100 # set the max number of columns to display 

In [4]:
# set the working directory (if necessary)
os.chdir('D:\Dropbox\Git repo\dm2020\data') # replace '...' with the location of the working directory

## 1. Importing Data<a id='sec1'></a>
- Data are saved in different formats. Typical file extensions are: tsv, csv, and xlsx (xls).
- The way you import data depends on the format of the data file.
- For tsv (tab separated data) files, use `pd.read_table()` or `pd.read_csv()`.
    - Make sure to include `delimiter='\t'` optin if you use `pd.read_csv()`. 
- For csv (comma separated data) files, use `pd.read_table()` or `pd.read_csv()`.
    - Make sure to include `sep=','` option if you use `pd.read_table()`.
- For excel files, use `pd.read_excel()`.
- Datasets
    - `elec_senate.xlsx`: US Senate general election results 2008-2014
    - `daily_TEMP_XXXX.csv`: US temperature 2008-2014

[back to top](#top)

In [5]:
elec_data = {} # make an empty dictionary
temp_data = {}
for year in range(2008,2016,2):
    elec_data[str(year)] = pd.read_excel('elec_senate.xlsx', sheet_name=str(year)) # add each dataset to the dictionary
    temp_data[str(year)] = pd.read_csv('daily_TEMP_'+str(year)+'.csv')

### Checking data entries
- The very first thing to do after importing data is to check the content of the data.
- To see data entries, use `object.head(#)` or `object.tail(#)`.
    - To print the list of columns, use `object.columns`.
    - To print the list of indices, use `object.index`. To print them, convert the outcome into a list using `list()`.

In [None]:
elec_data['2008']

- Next, check column names and indices are unique and non-missing.
    - To check the uniqueness, use `object.is_unique`.
    - To check the non-missingness, use `object.isna().any()`.
        - `isna()` returns `True` if an item is missing. `any()` applies this method to all items.
- Check that columns and indices are unique and non-missing in `elec_data['2008']`.

In [None]:
elec_data['2008']

- To check the types of each items, use `object.dtypes`.
    - If you want to specify the type while importing data, write like `dtype={'column1':np.float64, 'column2':str}`, etc.
- Run the following code. Then, add `dtype={'elec_year':str}` in `pd.read_excel()` above and run the same code. What did you find?

In [None]:
print(type(elec_data['2008'].elec_year)) # type of a column (variable)
print(type(elec_data['2008'].elec_year[0])) # type of an item in the same column
print(type(elec_data['2008'].gelec_dem[0]))

- To get the summary statistics of the data, use `object.describe()`.

In [None]:
elec_data['2008'].describe()

## 2. Treating Missing Values<a id='sec2'></a>
- Print `elec_data['2008']`. Why some values are missing?
- There are several strategies to handle missing data:
    - 1. Drop them
    - 2. Replace it with a sentinel value (e.g., -999)
    - 3. Do nothing (decide later)

[back to top](#top)

In [None]:
elec_data['2008']

### What are missing values in Python? 
- There are two (three) types.
    - `None`: The absence of a value.
    - `NaN` (Not a Number): A missing floating-point number.
    - (`inf`: an infinite number.)
- Print series in the below example.
    - Observe that `None` appears as either `None` or `NaN`.
- `series1` and `series3` look the same, but check if they actually are using `is` and `==`.
- What happens if you aggregate all items in each series?
    - Try all serieses. Also, try both `np.sum()` and `sum()`. What did you get?

In [None]:
series1 = pd.Series([1, 2, 3, None])
series2 = pd.Series([1, 2, 3, None], dtype=object) 
series3 = pd.Series([1, 2, 3, np.nan])
series4 = pd.Series([1, 2, 3, np.inf])
print(series1); print(series2); print(series3); print(series4)

### Handling missing values
- Useful methods:
    - For checking: `object.isna()`, `object.notna()`
        - You can combine them with `.any()` or `.all()`, e.g. `object.isna().any()`.
    - For deleting: `object.dropna()`. You need to define a new object to reflect the changes.
    - For replacing: `object.fillna()`.  You need to define a new object to reflect the changes.
- Check and replace all missing values with zeros in `series1`.
- Drop missing values in `df1`.
    - Try `object.dropna(how='all', axis='columns')`, `object.dropna(how='all', axis='rows')`, and `object.dropna(how='any', axis='rows')`. What did you find?

In [None]:
series1 = pd.Series([1.0, np.nan, 3.0, None])
df1 = pd.DataFrame([[1.0, np.nan, 3.0],[4.0, 5.0, None],[7.0, 8.0, 9.0],[np.nan, np.nan, np.nan]])
print(series1); print(df1)

- Alternatively, you can use `object.notna()` instead of `object.dropna()` for subsetting.
    - However, this technique does not always work.
- Run the following code. Try the same thing for `df1`. What did you find?

In [None]:
series1 = pd.Series([1.0, np.nan, 3.0, None])
df1 = pd.DataFrame([[1.0, np.nan, 3.0],[4.0, 5.0, None],[7.0, 8.0, 9.0],[np.nan, np.nan, np.nan]])
print(series1)
series1 = series1[series1.notna()]
print(series1)

## 3. Selecting Columns<a id='sec3'></a>
- Data may contain some redundant columns that will never be used in any analysis. We will drop such columns to reduce the data size.
- Candidates:
    1. A column whose values are all missing
    2. A column whose information is not important
- You should be very careful about selecting columns.
    - You may not use those columns in the current project but may use them in another project.
- Let's check `elec_data['2008']`.

[back to top](#top)

In [None]:
elec_data['2008']

- All columns look good. What about `temp_data['2008']`?

In [None]:
temp_data['2010']

- Let's drop `'Parameter Name'`, `'Sample Duration'`, `'Pollutant Standard'`, `'Units of Measure'`, and `'AQI'`.
- First, remove columns that are all missing.

In [None]:
for year in range(2008,2016,2):
    temp_data[str(year)+'_keep'] = temp_data[str(year)].dropna(how='all', axis='columns')

- Next, drop the remaining columns.

In [None]:
for year in range(2008,2016,2):
    temp_data[str(year)+'_keep'] = temp_data[str(year)+'_keep'].drop(['Parameter Name', 'Sample Duration', 'Units of Measure'], axis=1)

### Changing column names
- If column names use uppercases and space(s), are very long, or are written in any other non-generic format, we need to change them.
- Check the column names in `temp_data['2008_keep']`.

In [None]:
temp_data['2008_keep'].columns

- I suggest that we (a) replace spaces with underscores and (b) change uppercases to lowercases.
    - To replace spaces with underscores, you can use `str.replace(' ', '_')`.
    - To change to lowercases, you can use `object.lower()`. If you want to use uppercases, use `object.upper()`. If you want to capitalize column names, use `object.capitalize()`.
- Run the following code and check the column names in `temp_data['2008_keep']`.

In [None]:
for year in range(2008,2016,2):
    temp_data[str(year)+'_keep'].columns = temp_data[str(year)+'_keep'].columns.str.replace(' ', '_') # .str returns strings. replace(a, b) changes a to b
    temp_data[str(year)+'_keep'].columns = [x.lower() for x in temp_data[str(year)+'_keep'].columns] # x.lower changes x to lowercases
print(temp_data['2008_keep'].columns)

## 4. Selecting Rows<a id='sec4'></a>
- Next, let's move on to rows.
- We need to decide which rows to keep in the final data.
- Candidates:
    - Rows that have missing values for important variables such as id
    - Multiple rows per unit of observation

[back to top](#top)

- First, let's look at `elec_data['2008']` again. Recall that all election results are missing for some rows.

In [None]:
elec_data['2008']

- Let's drop such rows.
- Run the following code.
    - The `subset` option implies that you apply `object.dropna()` only for those columns.

In [None]:
for year in range(2008,2016,2):
    elec_data[str(year)+'_keep'] = elec_data[str(year)].dropna(how='all', subset=['gelec_dem', 'gelec_rep', 'gelec_oth'])
elec_data['2008_keep']

- Next, print `'date_local'` column for `"New York"` using `'state_name'` in `temp_data['2008_keep']`.
    - What does this column mean?
    - Should we drop all the dates that are not needed for the analysis? Election Days are:
        - November 4th, 2008, November 2nd, 2010, November 6th, 2012, November 8th, 2014

In [None]:
temp_data['2008_keep'].loc[temp_data['2008_keep'].state_name == "New York", 'date_local'] 

- I suggest we keep all dates. 
    - The reason: We may use daily temperature on non-election days as well.
- Next, print `temp_data['2008_keep']` for `"New York"` and `"2008-11-04"`, i.e., Election Day.
    - Why are there multiple entries?

In [None]:
temp_data['2008_keep'].loc[(temp_data['2008_keep'].state_name == "New York") & (temp_data['2008_keep'].date_local == "2008-11-04"), ]

- What should we do? There are several options. What are the pros and cons of each method?
    - 1. Aggregate data
    - 2. Reshape data
    - 3. Delete some observations
    - 4. Do nothing (decide later)

- There is no right answer!
- Let's take the fourth option for now. Two reasons: 
    - There is no point of losing rich information in data (regarding 1 and 3). 
    - There will be many missing values if we reshape them (regarding 2).

## 5. Treating Other Values<a id='sec5'></a>
- It could often be the case that you need to modify entries in a dataset "by hand".
    - Example: Some data entries are incorrect.
- The bottom line: Save code for whatever you did for original data, if possible.

[back to top](#top)

## 6. Saving Data<a id='sec6'></a>
- You can save data in any format. However, the most preferred one is csv. Why?
    - Since csv is just a text file, it can be read in any text editor. Easy to share.
    - By contrast, if you do not want to lose information about formatting, macros, etc., it may be better to save it in excel format.
- To save data as a csv file, use `mydata.to_csv`.
    - Use `mydata.to_excel` for saving in excel format.
    
[back to top](#top)

In [None]:
for year in range(2008,2016,2):
    elec_data[str(year)+'_keep'].to_csv('us_senate_'+str(year)+'.csv', index=False)
    temp_data[str(year)+'_keep'].to_csv('us_daily_temp_'+str(year)+'.csv', index=False)