## Comprehensive Exam - Positive Phase Duration Plot Using Kingery Bulmash Data

The notebook cleans the digitized data from the 1984 Kingery Bulmash study.

### Data Sources
- file1 : The data came from digitizing the original plots from the Kingery Bulmash paper.  There were 4 sets of data digitized:
    - The Kingery Bulmash curve fit.
    - The plot of data from references 6 & 7
    - The plot of data from reference 8
    - The plot of data from reference 9

### Changes
- 12-18-2018 : Started project

In [27]:
import pandas as pd
from pathlib import Path
from datetime import datetime

### File Locations

In [42]:
today = datetime.today()
in_file_cv_data = Path.cwd().parents[2] / "2_data" / "kingery_bulmash_positive_phase_duration" / "raw" / "KB_data_positive_phase_duration_curve_data_plot_digitized.csv"
in_file_67_data = Path.cwd().parents[2] / "2_data" / "kingery_bulmash_positive_phase_duration" / "raw" / "KB_data_positive_phase_duration_ref_6_and7_data_plot_digitized.csv"
in_file_08_data = Path.cwd().parents[2] / "2_data" / "kingery_bulmash_positive_phase_duration" / "raw" / "KB_data_positive_phase_duration_ref_8_data_plot_digitized.csv"
in_file_09_data = Path.cwd().parents[2] / "2_data" / "kingery_bulmash_positive_phase_duration" / "raw" / "KB_data_positive_phase_duration_ref_9_data_plot_digitized.csv"

summary_file_cv = Path.cwd().parents[2] / "2_data" / "kingery_bulmash_positive_phase_duration" / "processed" / f"summary_cv_{today:%b-%d-%Y}.pkl"
summary_file_67 = Path.cwd().parents[2] / "2_data" / "kingery_bulmash_positive_phase_duration" / "processed" / f"summary_67_{today:%b-%d-%Y}.pkl"
summary_file_08 = Path.cwd().parents[2] / "2_data" / "kingery_bulmash_positive_phase_duration" / "processed" / f"summary_08_{today:%b-%d-%Y}.pkl"
summary_file_09 = Path.cwd().parents[2] / "2_data" / "kingery_bulmash_positive_phase_duration" / "processed" / f"summary_09_{today:%b-%d-%Y}.pkl"

In [29]:
df_cv = pd.read_csv(in_file_cv_data)
df_67 = pd.read_csv(in_file_67_data)
df_08 = pd.read_csv(in_file_08_data)
df_09 = pd.read_csv(in_file_09_data)

### Column Cleanup

- Remove all leading and trailing spaces
- Rename the columns for consistency.

In [36]:
# https://stackoverflow.com/questions/30763351/removing-space-in-dataframe-python
df_cv.columns = [x.strip() for x in df_cv.columns]
df_67.columns = [x.strip() for x in df_67.columns]
df_08.columns = [x.strip() for x in df_08.columns]
df_09.columns = [x.strip() for x in df_09.columns]

In [39]:
cols_to_rename1 = {'x': 'scaled_distance'}
cols_to_rename2 = {'y': 'time_of_arrival'}
df_cv.rename(columns=cols_to_rename1, inplace=True)
df_cv.rename(columns=cols_to_rename2, inplace=True)

df_67.rename(columns=cols_to_rename1, inplace=True)
df_67.rename(columns=cols_to_rename2, inplace=True)

df_08.rename(columns=cols_to_rename1, inplace=True)
df_08.rename(columns=cols_to_rename2, inplace=True)

df_09.rename(columns=cols_to_rename1, inplace=True)
df_09.rename(columns=cols_to_rename2, inplace=True)

### Clean Up Data Types

In [40]:
df_cv.dtypes

scaled_distance    float64
time_of_arrival    float64
dtype: object

### Data Manipulation

### Save output file into processed directory

Save a file in the processed directory that is cleaned properly. It will be read in and used later for further analysis.

Other options besides pickle include:
- feather
- msgpack
- parquet

In [44]:
df_cv.to_pickle(summary_file_cv)
df_67.to_pickle(summary_file_67)
df_08.to_pickle(summary_file_08)
df_09.to_pickle(summary_file_09)