# Exploring Data Practical Solutions

Please note that there are many possible ways to complete the practical tasks that are not limited to the solutions provided by this document. The output of your code should however exactly match the following solutions.

---

1.  Start a new Jupyter Notebook

2.  Import the `pandas` Python package using the standard alias: `pd`, as well as `matplotlib.pyplot` as `plt`

In [0]:
import pandas as pd
import matplotlib.pyplot as plt

3. Load the data stored in the file 'data/spending_ch3_practical.tsv' into a `DataFrame` named `spending_df` with the `unique_id` column set as the index column 

In [0]:
spending_df = pd.read_table('data/spending_ch3_practical.tsv', index_col = 'unique_id')

4. Use the `head` method with its appropriate parameter to display the first 12 lines of `spending_practical_df`.

In [0]:
spending_df.head(n = 12)

5. What are the mean and std deviation values of the columns `nb_beneficiaries` and  `spending`?

In [0]:
spending_df.loc[:, 'nb_beneficiaries'].mean()
spending_df.loc[:, 'nb_beneficiaries'].std()

spending_df.loc[:, 'spending'].mean()
spending_df.loc[:, 'spending'].std()

6. Sort the `spending_df` `DataFrame` on the `specialty` column in a way that permanently modifies its sorting order.

In [0]:
spending_df.sort_values(by = 'specialty', inplace = True)

7. How many instances of `CARDIOLOGY` are in the `specialty` column of `spending_practical_df`?

In [0]:
spending_df[spending_df.loc[:, 'specialty'] == 'CARDIOLOGY'].shape[0]

8. Add a new column to `spending_df` named `total_spending_pct` which is the percentage of total spending that each entry accounts for (each entry should be between 0 and 100 and the sum of all the entries in `total_spending_pct` should be 100 )

In [0]:
spending_df.loc[:, 'total_spending_pct'] = (spending_df.loc[:, 'spending'] / spending_df.loc[:, 'spending'].sum()) * 100

9. Make a new `DataFrame` that is a subset of `spending_df` named `top_spenders_df` which contains only the columns `doctor_id` `spending` and `total_spending_pct` and those entries whose `total_spending_pct` is more than 1%

In [0]:
top_spenders_df = spending_df[spending_df.loc[:, 'total_spending_pct'] > 1].loc[:, ['doctor_id', 'spending', 'total_spending_pct']]

10. The doctor ids in the dataset we are working with are unique for each row entry. This can be confirmed by observing that the unique count of doctor ids matches the number of rows in `spending_df`. Therefore, a quick check of the results can be done by visualizing who the top spenders are; the same doctors who are saved in the `top_spenders_df` `DataFrame` should also be the among the top spenders in `spending_df`. 

  * $1^{st}$ Make a new `DataFrame` `doctor_spending_df` that is a subset of  `spending_df` indexed by `doctor_id` and with the single column `spending`.  `doctor_spending_df` should also be sorted in descending order by the values in `spending`

  * $2^{nd}$ Plot a vertical bar plot of what you think is an appropriate number of rows from `doctor_spending_df` to verify the results saved in `top_spenders_df`

In [0]:
doctor_spending_df = spending_df.loc[:, ['spending']]
doctor_spending_df.index = spending_df.loc[:, 'doctor_id']
doctor_spending_df.sort_values(by = 'spending', ascending = False, inplace = True)

In [0]:
plt.figure()
doctor_spending_df.head(n = 10).plot(kind = 'bar')

11. Save `top_spenders_df` as a CSV into a file called `data/big_spenders_practical.csv`

In [0]:
top_spenders_df.to_csv('data/big_spenders_practical.csv')