#### This is the Jupiter Notebook for my Tidy Data Project, using the data file regarding the federal research and development budgets across different government departments.

In tyding the data, I must first define my dataframe

In [3]:
import pandas as pd #import pandas dictionary to be referred to later

df_fedrd = pd.read_csv('fed_rd_year&gdp.csv') # define dataframe

Next, I need to melt my dataframe such that the different variables are defined. This step is what reorients the columns such that the years and GDPs, which are values, are not correctly different rows. It must still be split such that the year and GDP variables are different columns. 

In [4]:
df_fedrd_melted = pd.melt(df_fedrd,
                          id_vars = 'department', # holds constant the column for department
                          value_vars = df_fedrd.columns[1:], # selects all the columns to be reoriented
                          var_name = 'Year_GDP', # renames the column now containing the previous column names
                          value_name='Research and Development Budget') #names the column containing all of the values from the original untidy dataframe

df_fedrd_melted

Unnamed: 0,department,Year_GDP,Research and Development Budget
0,DHS,1976_gdp1790000000000.0,
1,DOC,1976_gdp1790000000000.0,8.190000e+08
2,DOD,1976_gdp1790000000000.0,3.569600e+10
3,DOE,1976_gdp1790000000000.0,1.088200e+10
4,DOT,1976_gdp1790000000000.0,1.142000e+09
...,...,...,...
583,NIH,2017_gdp19177000000000.0,3.305200e+10
584,NSF,2017_gdp19177000000000.0,6.040000e+09
585,Other,2017_gdp19177000000000.0,1.553000e+09
586,USDA,2017_gdp19177000000000.0,2.625000e+09


Now, I still need to split the dataframe such that the gdp and year are different columns, with correct formatting

In [24]:
df_fedrd_melted[['Year','GDP']] = df_fedrd_melted['Year_GDP'].str.split('_', expand=True) # splits the column with two variables into two columns, expand=true makes them two columns instead of a list of two variables
df_fedrd_melted_tidy = df_fedrd_melted.drop('Year_GDP', axis=1) # gets rid of messy/unwanted column
df_fedrd_melted_tidy['GDP'] = df_fedrd_melted_tidy['GDP'].str.replace('gdp','') # fixes formatting such that GDP column does not contain unnecessary information
df_fedrd_melted_tidy

Unnamed: 0,department,Research and Development Budget,Year,GDP
0,DHS,,1976,1790000000000.0
1,DOC,8.190000e+08,1976,1790000000000.0
2,DOD,3.569600e+10,1976,1790000000000.0
3,DOE,1.088200e+10,1976,1790000000000.0
4,DOT,1.142000e+09,1976,1790000000000.0
...,...,...,...,...
583,NIH,3.305200e+10,2017,19177000000000.0
584,NSF,6.040000e+09,2017,19177000000000.0
585,Other,1.553000e+09,2017,19177000000000.0
586,USDA,2.625000e+09,2017,19177000000000.0


The next step is to tidy the data with regards to the missing data, as for the first few years there is no recorded information regarding the research and development budget of the DHS. What is missing from this dataset, nore szpecifically, is the research and development budget for the department DHS (that is the department for homeland security), from 1976 to 2001. This makes sense because the department of homeland security was created in 2002. Thus, this value is missing at random. It is not dependent on the value of the budget itself but instead dependent on a different variable. Because it is known that the department does not exist, we can impute a value of zero for the budget, as no budget was allocated to the department. 

In [41]:
df_fedrd_melted_tidy.loc[df_fedrd_melted_tidy['Research and Development Budget'].isnull()] = df_fedrd_melted_tidy.loc[df_fedrd_melted_tidy['Research and Development Budget'].isnull()].fillna(0) # ask if this is correct imputation
 # the loc  property allows me to manipulate the data using the names of the columns as opposed to the indices. In this instance, I am using the "&" to find the location where it is true the research and development budget is zero
 # knowing, based on context from the data set and knowledge of the DHS, I imputed a value of zero for the budget with the .fillna() function
df_fedrd_melted_tidy

Unnamed: 0,department,Research and Development Budget,Year,GDP
0,DHS,0,1976,1790000000000.0
1,DOC,819000000.0,1976,1790000000000.0
2,DOD,35696000000.0,1976,1790000000000.0
3,DOE,10882000000.0,1976,1790000000000.0
4,DOT,1142000000.0,1976,1790000000000.0
...,...,...,...,...
583,NIH,33052000000.0,2017,19177000000000.0
584,NSF,6040000000.0,2017,19177000000000.0
585,Other,1553000000.0,2017,19177000000000.0
586,USDA,2625000000.0,2017,19177000000000.0


Now that I have a tidy dataset with each of the variables as a column, each observation in its own row, and an imputation for the missing data, it is important to create different tables for the observational units. In this case, the observational units are the department, so I will create tables for each department to demonstrate the budget and GDP trends for each individual unit.

In [44]:
df_fedrd_melted_tidy # ask how to make tables for each department

Unnamed: 0,department,Research and Development Budget,Year,GDP
0,DHS,0,1976,1790000000000.0
1,DOC,819000000.0,1976,1790000000000.0
2,DOD,35696000000.0,1976,1790000000000.0
3,DOE,10882000000.0,1976,1790000000000.0
4,DOT,1142000000.0,1976,1790000000000.0
...,...,...,...,...
583,NIH,33052000000.0,2017,19177000000000.0
584,NSF,6040000000.0,2017,19177000000000.0
585,Other,1553000000.0,2017,19177000000000.0
586,USDA,2625000000.0,2017,19177000000000.0


pivot table for budget
https://stackoverflow.com/questions/15891038/change-column-type-in-pandas

In [50]:
df_fedrd_melted_tidy['Research and Development Budget'] = pd.to_numeric(df_fedrd_melted_tidy['Research and Development Budget'])
df_fedrd_melted_tidy['GDP'] = pd.to_numeric(df_fedrd_melted_tidy['GDP'])
df_fedrd_melted_tidy['Year'] = pd.to_numeric(df_fedrd_melted_tidy['Year'])

In [51]:
df_fedrd_melted_tidy

Unnamed: 0,department,Research and Development Budget,Year,GDP
0,DHS,0.000000e+00,1976,1.790000e+12
1,DOC,8.190000e+08,1976,1.790000e+12
2,DOD,3.569600e+10,1976,1.790000e+12
3,DOE,1.088200e+10,1976,1.790000e+12
4,DOT,1.142000e+09,1976,1.790000e+12
...,...,...,...,...
583,NIH,3.305200e+10,2017,1.917700e+13
584,NSF,6.040000e+09,2017,1.917700e+13
585,Other,1.553000e+09,2017,1.917700e+13
586,USDA,2.625000e+09,2017,1.917700e+13


In [55]:
pivot_table_gdp_rdbudget = pd.pivot_table(df_fedrd_melted_tidy, values = ['Research and Development Budget', 'GDP'], index = 'department', aggfunc = 'mean')
pivot_table_gdp_rdbudget

Unnamed: 0_level_0,GDP,Research and Development Budget
department,Unnamed: 1_level_1,Unnamed: 2_level_1
DHS,9175119000000.0,379000000.0
DOC,9175119000000.0,1231500000.0
DOD,9175119000000.0,64685190000.0
DOE,9175119000000.0,11883380000.0
DOT,9175119000000.0,917785700.0
EPA,9175119000000.0,750428600.0
HHS,9175119000000.0,22296760000.0
Interior,9175119000000.0,900571400.0
NASA,9175119000000.0,12140260000.0
NIH,9175119000000.0,21117570000.0
