# Data Cleaning and Preperation Practical Solutions

Please note that there are many possible ways to complete the practical tasks that are not limited to the solutions provided by this document. The output of your code should however exactly match the following solutions.

---

1.  Start a new Jupyter Notebook

2.  Import the `pandas` Python package using the standard alias: `pd`, as well as `matplotlib.pyplot` as `plt`

In [0]:
import pandas as pd

3. Read the file `data/spending_ch4_practical_1.csv` located in the data folder into a new `pandas` DataFrame named `spending_df` with index column set to 'unique_id'

In [0]:
spending_df = pd.read_csv('data/spending_ch4_practical_1.csv', index_col='unique_id')

4. Inspect the data types of each of the columns of `spending_df`, do you notice anything that should be corrected? 

In [0]:
spending_df.dtypes

* Change the data type of the `doctor_id` column to 'object' and the `spending` column to 'float64'

In [0]:
spending_df.doctor_id = spending_df.doctor_id.astype('object')
spending_df.spending = (spending_df.spending
                        .str.replace("$", "")
                        .str.replace(",", "")
                        .astype("float64"))

5. Drop rows that have less than `4` non-missing values inplace

In [0]:
spending_df.dropna(thresh=4, axis='rows', inplace=True)

6. Replace the missing values for the columns `nb_beneficiaries` and `spending`  with their respective medians and the missing values in `specialty` with the most frequent specialty. The replacement should be inplace, i.e. the orignal `DataFrame` should be updated

In [0]:
spending_df.fillna({'nb_beneficiaries': spending_df.nb_beneficiaries.median(),
                   'spending': spending_df.spending.median(),
                   'speciality': spending_df.specialty.mode()},
                  inplace=True)

7. What are the numbers of the rows for which the value of the column `specialty` is "NURSE PRACTITIONER" and the value of spending is lower than $5000?

In [0]:
spending_df[(spending_df.specialty == 'NURSE PRACTITIONER') & 
            (spending_df.spending < 5000)].shape[0]

  * Remove those rows from the original `DataFrame`

In [0]:
spending_df = spending_df[~(((spending_df.specialty == 'NURSE PRACTIONER') & (spending_df.spending < 5000)))]

8. Read the file `data/spending_ch4_practical_2.csv` located in the data folder into the `pandas` DataFrame `spending_df` with index column set to 'unique_id'

In [0]:
spending_df = pd.read_csv('data/spending_ch4_practical_2.csv', index_col='unique_id')

9. Filter out any specialties that have less than 200 records or for which the total number of beneficiaries is less than 15,000.

  * Furthermore, save your results as a sorted `DataFrame`. The sort order should be by specialty (Ascending), nb_beneficiaries (descending), spending (descending), respectively.

In [0]:
def filter_spending(x):
  return (x.shape[0] >= 200) and (x.nb_beneficiaries.sum() >= 15000)

spending_by_specialty = spending_df.groupby('specialty')
filtered_spending_df = spending_by_specialty.filter(filter_spending)
filtered_sorted_spending_df = (
    filtered_spending_df.sort_values(by=['specialty', 
                                         'nb_beneficiaries', 
                                         'spending'], 
                                     ascending=[True, False, False]))



  * How many specialties pass this filtering?

In [0]:
filtered_sorted_spending_df.specialty.unique().shape[0]

10. We covered the code below in this module. Do you remember what it does?

```python 

def my_function(x):
    return (x   / x.sum() ) * 100
    
spending_by_specialty = spending_df.groupby('specialty')
spending_df["spending_pct"] = spending_by_specialty['spending'].transform(my_function)

medication_spending_pct = spending_df.groupby(["specialty", "medication"])["spending_pct"].sum().reset_index()
```

  * Copy and paste the code into a cell. Run the cell and print the first five rows of `medication_spending_pct` `DataFrame` using the method `head`. 

In [0]:
def my_function(x):
    return (x   / x.sum() ) * 100

spending_by_specialty = spending_df.groupby('specialty')
spending_df["spending_pct"] = spending_by_specialty['spending'].transform(my_function)

medication_spending_pct = spending_df.groupby(["specialty", "medication"])["spending_pct"].sum().reset_index()

medication_spending_pct.head(n=5)

11. Group `medication_spendng_pct` on specialty and filter the specialties for which the sum of the top 2 medicines in terms of spending_pct is < 80. For instance, the sum of the `spending_pct` for the highest 2 entries for `"ADDICTION MEDICINE"`  is 88.89 + 8.98 =  97.87. Therefore, we should retain this specialty. However, the sum of the top 2 medicines in "ALLERGY/IMMUNOLOGY" is 41.89 + 8.14 = 43.10; therefore, we should discard this specialty.

In [0]:
def filter_medication(x):
     return x.specialty.nlargest(2).sum() >= 80

medication_spending_by_specialty = medication_spending_pct.groupby(specialty)
filtered_medication_spending = medication_spending_by_specialty.filter(filter_medication)

 * Print only the top two entries of each specialty in the resulting `DataFrame`. 

In [0]:
filtered_medication_spending.groupby('specialty')['spending_pct'].nlargest(n=2)