In [2]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

In [3]:
from static_grader import grader

# PW Miniproject
## Introduction

The objective of this miniproject is to exercise your ability to use basic Python data structures, define functions, and control program flow. We will be using these concepts to perform some fundamental data wrangling tasks such as joining data sets together, splitting data into groups, and aggregating data into summary statistics.
**Please do not use `pandas` or `numpy` to answer these questions.**

We will be working with medical data from the British NHS on prescription drugs. Since this is real data, it contains many ambiguities that we will need to confront in our analysis. This is commonplace in data science, and is one of the lessons you will learn in this miniproject.

## Downloading the data

We first need to download the data we'll be using from Amazon S3:

In [4]:
%%bash
mkdir pw-data
wget http://dataincubator-wqu.s3.amazonaws.com/pwdata/201701scripts_sample.json.gz -nc -P ./pw-data
wget http://dataincubator-wqu.s3.amazonaws.com/pwdata/practices.json.gz -nc -P ./pw-data

mkdir: cannot create directory ‘pw-data’: File exists
File ‘./pw-data/201701scripts_sample.json.gz’ already there; not retrieving.

File ‘./pw-data/practices.json.gz’ already there; not retrieving.



## Loading the data

The first step of the project is to read in the data. We will discuss reading and writing various kinds of files later in the course, but the code below should get you started.

In [5]:
import gzip
import simplejson as json
import pandas as pd

In [6]:
with gzip.open('./pw-data/201701scripts_sample.json.gz', 'rb') as f:
    scripts =pd.read_json(f)
with gzip.open('./pw-data/practices.json.gz', 'rb') as f:
    practices = pd.read_json(f)

In [4]:
scripts.head()

Unnamed: 0,act_cost,bnf_code,bnf_name,items,nic,practice,quantity
0,5.56,0101010G0AAABAB,Co-Magaldrox_Susp 195mg/220mg/5ml S/F,2,5.98,N81013,1000
1,1.82,0101021B0AAAHAH,Alginate_Raft-Forming Oral Susp S/F,1,1.95,N81013,500
2,59.95,0101021B0AAALAL,Sod Algin/Pot Bicarb_Susp S/F,12,64.51,N81013,6300
3,8.55,0101021B0AAAPAP,Sod Alginate/Pot Bicarb_Tab Chble 500mg,3,9.21,N81013,180
4,26.84,0101021B0BEADAJ,Gaviscon Infant_Sach 2g (Dual Pack) S/F,6,28.92,N81013,90


In [6]:
practices.head()

Unnamed: 0,addr_1,addr_2,borough,code,name,post_code,village
0,THE HEALTH CENTRE,LAWSON STREET,STOCKTON ON TEES,A81001,THE DENSHAM SURGERY,TS18 1HU,CLEVELAND
1,QUEENS PARK MEDICAL CTR,FARRER STREET,STOCKTON ON TEES,A81002,QUEENS PARK MEDICAL CENTRE,TS18 2AW,CLEVELAND
2,THE HEALTH CENTRE,VICTORIA ROAD,HARTLEPOOL,A81003,VICTORIA MEDICAL PRACTICE,TS26 8DB,CLEVELAND
3,6 WOODLANDS ROAD,,MIDDLESBROUGH,A81004,WOODLANDS ROAD SURGERY,TS1 3BE,CLEVELAND
4,SPRINGWOOD SURGERY,RECTORY LANE,GUISBOROUGH,A81005,SPRINGWOOD SURGERY,TS14 7DJ,


In [5]:
with gzip.open('./pw-data/201701scripts_sample.json.gz', 'rb') as f:
    scripts = json.load(f)

with gzip.open('./pw-data/practices.json.gz', 'rb') as f:
    practices = json.load(f)

This data set comes from Britain's National Health Service. The `scripts` variable is a list of prescriptions issued by NHS doctors. Each prescription is represented by a dictionary with various data fields: `'practice'`, `'bnf_code'`, `'bnf_name'`, `'quantity'`, `'items'`, `'nic'`, and `'act_cost'`. 

In [6]:
scripts[:2]

[{'bnf_code': '0101010G0AAABAB',
  'items': 2,
  'practice': 'N81013',
  'bnf_name': 'Co-Magaldrox_Susp 195mg/220mg/5ml S/F',
  'nic': 5.98,
  'act_cost': 5.56,
  'quantity': 1000},
 {'bnf_code': '0101021B0AAAHAH',
  'items': 1,
  'practice': 'N81013',
  'bnf_name': 'Alginate_Raft-Forming Oral Susp S/F',
  'nic': 1.95,
  'act_cost': 1.82,
  'quantity': 500}]

A [glossary of terms](http://webarchive.nationalarchives.gov.uk/20180328130852tf_/http://content.digital.nhs.uk/media/10686/Download-glossary-of-terms-for-GP-prescribing---presentation-level/pdf/PLP_Presentation_Level_Glossary_April_2015.pdf/) and [FAQ](http://webarchive.nationalarchives.gov.uk/20180328130852tf_/http://content.digital.nhs.uk/media/10048/FAQs-Practice-Level-Prescribingpdf/pdf/PLP_FAQs_April_2015.pdf/) is available from the NHS regarding the data. Below we supply a data dictionary briefly describing what these fields mean.

| Data field |Description|
|:----------:|-----------|
|`'practice'`|Code designating the medical practice issuing the prescription|
|`'bnf_code'`|British National Formulary drug code|
|`'bnf_name'`|British National Formulary drug name|
|`'quantity'`|Number of capsules/quantity of liquid/grams of powder prescribed|
| `'items'`  |Number of refills (e.g. if `'quantity'` is 30 capsules, 3 `'items'` means 3 bottles of 30 capsules)|
|  `'nic'`   |Net ingredient cost|
|`'act_cost'`|Total cost including containers, fees, and discounts|

The `practices` variable is a list of member medical practices of the NHS. Each practice is represented by a dictionary containing identifying information for the medical practice. Most of the data fields are self-explanatory. Notice the values in the `'code'` field of `practices` match the values in the `'practice'` field of `scripts`.

In [7]:
practices[:2]

[{'code': 'A81001',
  'name': 'THE DENSHAM SURGERY',
  'addr_1': 'THE HEALTH CENTRE',
  'addr_2': 'LAWSON STREET',
  'borough': 'STOCKTON ON TEES',
  'village': 'CLEVELAND',
  'post_code': 'TS18 1HU'},
 {'code': 'A81002',
  'name': 'QUEENS PARK MEDICAL CENTRE',
  'addr_1': 'QUEENS PARK MEDICAL CTR',
  'addr_2': 'FARRER STREET',
  'borough': 'STOCKTON ON TEES',
  'village': 'CLEVELAND',
  'post_code': 'TS18 2AW'}]

In the following questions we will ask you to explore this data set. You may need to combine pieces of the data set together in order to answer some questions. Not every element of the data set will be used in answering the questions.

## Question 1: summary_statistics

Our beneficiary data (`scripts`) contains quantitative data on the number of items dispensed (`'items'`), the total quantity of item dispensed (`'quantity'`), the net cost of the ingredients (`'nic'`), and the actual cost to the patient (`'act_cost'`). Whenever working with a new data set, it can be useful to calculate summary statistics to develop a feeling for the volume and character of the data. This makes it easier to spot trends and significant features during further stages of analysis.

Calculate the sum, mean, standard deviation, and quartile statistics for each of these quantities. Format your results for each quantity as a list: `[sum, mean, standard deviation, 1st quartile, median, 3rd quartile]`. We'll create a `tuple` with these lists for each quantity as a final result.

In [8]:
scripts[:2]

[{'bnf_code': '0101010G0AAABAB',
  'items': 2,
  'practice': 'N81013',
  'bnf_name': 'Co-Magaldrox_Susp 195mg/220mg/5ml S/F',
  'nic': 5.98,
  'act_cost': 5.56,
  'quantity': 1000},
 {'bnf_code': '0101021B0AAAHAH',
  'items': 1,
  'practice': 'N81013',
  'bnf_name': 'Alginate_Raft-Forming Oral Susp S/F',
  'nic': 1.95,
  'act_cost': 1.82,
  'quantity': 500}]

In [17]:
len(scripts)

382726

In [9]:
import statistics
import numpy as np
def describe(key):
    sdata = []
    for d in scripts[:]:
        sdata.append(d[key])
    total = sum(sdata)
     ####################################   
    avg = total/len(scripts)
    #####################################
    variance = []
    sdata = []
    for d in scripts[:]:
        sdata.append(d[key])
    for x in range(len(scripts)):
        variance.append((sdata[x] - avg)**2)
    s = (sum(variance)/len(scripts))**0.5
    #s = statistics.stdev(sdata)
    #####################################
    sdata = []
    for d in scripts[:]:
        sdata.append(d[key])
    #c = round((len(scripts))/4)
    #q25 = sorted(sdata, key=int)[c]
    a = np.array(sdata)
    q25 = np.percentile(a,25)
    #####################################
    sdata = []
    for d in scripts[:]:
        sdata.append(d[key])
    #med = statistics.median(sdata)
    a = np.array(sdata)
    med = np.percentile(a,50)
    ####################################
    sdata = []
    for d in scripts[:]:
        sdata.append(d[key])
    #d = round(3*(len(scripts))/4)
    #q75 = sorted(sdata, key=int)[d]
    a = np.array(sdata)
    q75 = np.percentile(a,75)
    ###########################################
    return (total, avg, s, q25, med, q75)

In [114]:
print(summary)

[('items', (4410054, 11.522744731217633, 33.11216633980368, 1.0, 3.0, 8.0)), ('quantity', (316356836, 826.5883059943667, 3872.1810146096263, 30.0, 120.0, 466.0)), ('nic', (29048309.790000338, 75.89844899484315, 197.5728266277507, 7.7, 22.62, 65.94)), ('act_cost', (27053937.599999707, 70.68748295124895, 183.26731895303854, 7.25, 21.24, 61.53))]


In [12]:
summary = [('items', describe('items')),
           ('quantity', describe('quantity')),
           ('nic', describe('nic')),
           ('act_cost', describe('act_cost'))]

In [13]:
grader.score.pw__summary_statistics(summary)

Your score:  1.0


## Question 2: most_common_item

Often we are not interested only in how the data is distributed in our entire data set, but within particular groups -- for example, how many items of each drug (i.e. `'bnf_name'`) were prescribed? Calculate the total items prescribed for each `'bnf_name'`. What is the most commonly prescribed `'bnf_name'` in our data?

To calculate this, we first need to split our data set into groups corresponding with the different values of `'bnf_name'`. Then we can sum the number of items dispensed within in each group. Finally we can find the largest sum.

We'll use `'bnf_name'` to construct our groups. You should have *5619* unique values for `'bnf_name'`.

In [11]:
from collections import Counter
bnf_list=[] #get a list of names for items
for i in range(len(scripts)):
    y = scripts[i]["bnf_name"]
    bnf_list.append(y)

In [12]:
#def count(x):
##Get number of times a bnf name appears in scripts
bnf_names = Counter(bnf_list)
assert(len(bnf_names) == 5619)

We want to construct "groups" identified by `'bnf_name'`, where each group is a collection of prescriptions (i.e. dictionaries from `scripts`). We'll construct a dictionary called `groups`, using `bnf_names` as the keys. We'll represent a group with a `list`, since we can easily append new members to the group. To split our `scripts` into groups by `'bnf_name'`, we should iterate over `scripts`, appending prescription dictionaries to each group as we encounter them.

In [15]:
bnf_names

Counter({'Co-Magaldrox_Susp 195mg/220mg/5ml S/F': 27,
         'Alginate_Raft-Forming Oral Susp S/F': 95,
         'Sod Algin/Pot Bicarb_Susp S/F': 256,
         'Sod Alginate/Pot Bicarb_Tab Chble 500mg': 88,
         'Gaviscon Infant_Sach 2g (Dual Pack) S/F': 351,
         'Gaviscon Advance_Liq (Aniseed) (Reckitt)': 402,
         'Gaviscon Advance_Tab Chble Mint(Reckitt)': 277,
         'Gaviscon Advance_Liq (Peppermint) S/F': 335,
         'Peptac_Liq (Peppermint) S/F': 322,
         'Alverine Cit_Cap 60mg': 186,
         'Hyoscine Butylbrom_Inj 20mg/ml 1ml Amp': 63,
         'Hyoscine Butylbrom_Tab 10mg': 409,
         'Mebeverine HCl_Tab 135mg': 402,
         'Mebeverine HCl_Cap 200mg M/R': 215,
         'Peppermint Oil_Cap E/C 0.2ml': 301,
         'Peppermint Oil_Cap E/C 0.2ml M/R': 136,
         'Colpermin_Cap E/C 0.2ml M/R': 103,
         'Ispag/Mebeverine_Gran Eff 3.5g/135mg S/F': 62,
         'Fybogel Mebeverine_Eff Gran Sach S/F': 54,
         'Ranitidine HCl_Tab 150mg': 418

In [13]:
groups = {name: [] for name in bnf_names}
item_total = 0
for script in scripts:
    name = script [ 'bnf_name' ]
    items = script [ 'items' ]
    list_v = groups [ name ]
    if not list_v:
        list_v.append ( items )
    else:
        list_v [ 0 ] += items
k = list ( groups.keys () )
v = list ( groups.values () )
max_index = v.index ( max ( v ) )
max_item = [ k [ max_index ], v [ max_index ] ]
print ( max_item)

['Omeprazole_Cap E/C 20mg', [113826]]


In [10]:
print(groups)

{'Co-Magaldrox_Susp 195mg/220mg/5ml S/F': [86], 'Alginate_Raft-Forming Oral Susp S/F': [392], 'Sod Algin/Pot Bicarb_Susp S/F': [2636], 'Sod Alginate/Pot Bicarb_Tab Chble 500mg': [200], 'Gaviscon Infant_Sach 2g (Dual Pack) S/F': [1978], 'Gaviscon Advance_Liq (Aniseed) (Reckitt)': [5568], 'Gaviscon Advance_Tab Chble Mint(Reckitt)': [1078], 'Gaviscon Advance_Liq (Peppermint) S/F': [3443], 'Peptac_Liq (Peppermint) S/F': [3162], 'Alverine Cit_Cap 60mg': [776], 'Hyoscine Butylbrom_Inj 20mg/ml 1ml Amp': [155], 'Hyoscine Butylbrom_Tab 10mg': [6019], 'Mebeverine HCl_Tab 135mg': [6851], 'Mebeverine HCl_Cap 200mg M/R': [1747], 'Peppermint Oil_Cap E/C 0.2ml': [1370], 'Peppermint Oil_Cap E/C 0.2ml M/R': [449], 'Colpermin_Cap E/C 0.2ml M/R': [237], 'Ispag/Mebeverine_Gran Eff 3.5g/135mg S/F': [121], 'Fybogel Mebeverine_Eff Gran Sach S/F': [104], 'Ranitidine HCl_Tab 150mg': [16177], 'Ranitidine HCl_Tab 300mg': [4158], 'Ranitidine HCl_Oral Soln 75mg/5ml S/F': [880], 'Esomeprazole_Tab E/C 20mg': [2200],

Now that we've constructed our groups we should sum up `'items'` in each group and find the `'bnf_name'` with the largest sum. The result, `max_item`, should have the form `[(bnf_name, item total)]`, e.g. `[('Foobar', 2000)]`.

**TIP:** If you are getting an error from the grader below, please make sure your answer conforms to the correct format of `[(bnf_name, item total)]`.

In [11]:
grader.score.pw__most_common_item([('Omeprazole_Cap E/C 20mg', 113826)])

Your score:  1.0


**Challenge:** Write a function that constructs groups as we did above. The function should accept a list of dictionaries (e.g. `scripts` or `practices`) and a tuple of fields to `groupby` (e.g. `('bnf_name')` or `('bnf_name', 'post_code')`) and returns a dictionary of groups. The following questions will require you to aggregate data in groups, so this could be a useful function for the rest of the miniproject.

In [38]:
def group_by_field(data, fields):
    groups = {tuple(script[field] for field in fields): [] for script in data}
    for script in data:
        groups[tuple(script[field] for field in fields)].append(script['items'])

    return groups

In [None]:
def group_by_field(data, fields):
    groups = {}
    return groups

In [22]:
max_item

['Omeprazole_Cap E/C 20mg', [113826]]

In [33]:
my_grp = group_by_field(scripts,('bnf_name',))

In [36]:
my_grp

{('Co-Magaldrox_Susp 195mg/220mg/5ml S/F',): [2,
  1,
  2,
  6,
  2,
  10,
  4,
  1,
  1,
  3,
  1,
  1,
  1,
  1,
  6,
  1,
  3,
  1,
  1,
  1,
  1,
  5,
  20,
  2,
  5,
  3,
  1],
 ('Alginate_Raft-Forming Oral Susp S/F',): [1,
  2,
  7,
  8,
  2,
  2,
  2,
  7,
  2,
  4,
  6,
  1,
  8,
  3,
  1,
  23,
  3,
  1,
  16,
  1,
  1,
  1,
  1,
  2,
  3,
  1,
  2,
  1,
  3,
  1,
  1,
  1,
  7,
  3,
  5,
  2,
  6,
  3,
  16,
  1,
  8,
  6,
  1,
  19,
  1,
  5,
  1,
  11,
  2,
  4,
  2,
  11,
  7,
  11,
  5,
  10,
  4,
  6,
  1,
  2,
  2,
  1,
  1,
  2,
  21,
  12,
  3,
  2,
  1,
  6,
  2,
  3,
  4,
  1,
  3,
  1,
  11,
  4,
  1,
  1,
  1,
  2,
  3,
  1,
  2,
  1,
  2,
  1,
  2,
  4,
  1,
  2,
  1,
  1,
  8],
 ('Sod Algin/Pot Bicarb_Susp S/F',): [12,
  12,
  21,
  25,
  24,
  4,
  7,
  10,
  27,
  1,
  25,
  4,
  2,
  4,
  1,
  12,
  2,
  7,
  7,
  16,
  2,
  2,
  1,
  3,
  2,
  1,
  1,
  4,
  1,
  2,
  3,
  1,
  105,
  4,
  5,
  31,
  55,
  8,
  8,
  5,
  2,
  3,
  56,
  28,
  1,
  4,
  2,
  

In [13]:
groups = group_by_field(scripts, ('bnf_name',))
test_max_item = ...

assert test_max_item == max_item

AssertionError: 

## Question 3: postal_totals

Our data set is broken up among different files. This is typical for tabular data to reduce redundancy. Each table typically contains data about a particular type of event, processes, or physical object. Data on prescriptions and medical practices are in separate files in our case. If we want to find the total items prescribed in each postal code, we will have to _join_ our prescription data (`scripts`) to our clinic data (`practices`).

Find the total items prescribed in each postal code, representing the results as a list of tuples `(post code, total items prescribed)`. Sort your results ascending alphabetically by post code and take only results from the first 100 post codes. Only include post codes if there is at least one prescription from a practice in that post code.

**NOTE:** Some practices have multiple postal codes associated with them. Use the alphabetically first postal code.

In [18]:
merged_df.head()

Unnamed: 0,act_cost,bnf_code,bnf_name,items,nic,practice,quantity,addr_1,addr_2,borough,code,name,post_code,village
224821,30.12,23803108006,3m Health Care_Cavilon Durable Barrier C,4,32.48,M85078,4,SPARKHILL PRIM. CARE CTR,856 STRATFORD ROAD,BIRMINGHAM,M85078,OAKWOOD SURGERY,B11 4BW,
224822,3.68,23803108011,3m Health Care_Cavilon Durable Barrier C,1,3.98,M85078,1,SPARKHILL PRIM. CARE CTR,856 STRATFORD ROAD,BIRMINGHAM,M85078,OAKWOOD SURGERY,B11 4BW,
225600,7.53,23803108006,3m Health Care_Cavilon Durable Barrier C,1,8.12,M85774,1,SPARKHILL PRIM. CARE CTR,856 STRATFORD ROAD,SPARKHILL,M85774,SPRINGFIELD SURGERY,B11 4BW,
226403,3.7,23803108011,3m Health Care_Cavilon Durable Barrier C,1,3.98,Y02620,1,856 STRATFORD ROAD,SPARKHILL,BIRMINGHAM,Y02620,THE HILL GENERAL PRACTICE & UCC,B11 4BW,WEST MIDLANDS
225601,5.55,23803108010,3m Health Care_Cavilon No Sting Barrier,1,5.98,M85774,1,SPARKHILL PRIM. CARE CTR,856 STRATFORD ROAD,SPARKHILL,M85774,SPRINGFIELD SURGERY,B11 4BW,


In [50]:
s = merged_df.groupby("post_code")["items"].sum()[0:100]

In [51]:
postal_total_100 = list(zip(s.index,s.values))

We can join `scripts` and `practices` based on the fact that `'practice'` in `scripts` matches `'code'` in `practices'`. However, we must first deal with the repeated values of `'code'` in `practices`. We want the alphabetically first postal codes.

In [44]:
practice_postal = {}
for practice in practices:
    if practice['code'] in practice_postal:
        x = practice['post_code']
        if practice['post_code'] < x:
            practice_postal[practice['code']] = x
    else:
        practice_postal[practice['code']] = practice['post_code']

In [45]:
print(practice_postal)

{'A81001': 'TS18 1HU', 'A81002': 'TS18 2AW', 'A81003': 'TS26 8DB', 'A81004': 'TS1 3BE', 'A81005': 'TS14 7DJ', 'A81006': 'TS18 2AT', 'A81007': 'TS24 7PW', 'A81008': 'TS6 6TD', 'A81009': 'TS5 6HF', 'A81011': 'TS24 7PW', 'A81012': 'TS3 6AL', 'A81013': 'TS12 2FF', 'A81014': 'TS23 2LA', 'A81015': 'TS10 1TZ', 'A81016': 'TS1 3QY', 'A81017': 'TS17 0EE', 'A81018': 'TS10 4NW', 'A81019': 'TS3 7RL', 'A81020': 'TS4 3BU', 'A81021': 'TS6 6TD', 'A81022': 'TS12 2TG', 'A81023': 'TS1 2NX', 'A81025': 'TS18 1HU', 'A81026': 'TS5 6HA', 'A81027': 'TS15 9DD', 'A81029': 'TS1 2NX', 'A81030': 'TS1 3RY', 'A81031': 'TS24 7PW', 'A81032': 'TS14 7DJ', 'A81033': 'TS3 6AL', 'A81034': 'TS17 0EE', 'A81035': 'TS1 3RX', 'A81036': 'TS20 2UZ', 'A81037': 'TS1 2NX', 'A81038': 'TS3 6AL', 'A81039': 'TS16 9EA', 'A81040': 'TS23 2DG', 'A81041': 'TS24 9DN', 'A81042': 'TS6 9QG', 'A81043': 'TS6 0HA', 'A81044': 'TS25 1QU', 'A81045': 'TS10 1SR', 'A81046': 'TS18 1YE', 'A81047': 'TS11 6BW', 'A81048': 'TS11 7BL', 'A81049': 'TS3 6AL', 'A8105

**Challenge:** This is an aggregation of the practice data grouped by practice codes. Write an alternative implementation of the above cell using the `group_by_field` function you defined previously.

In [10]:
assert practice_postal['K82019'] == 'HP21 8TR'

Now we can join `practice_postal` to `scripts`.

In [15]:
joined = scripts[:] #joining {script['practice']:practice['post_code']}
join_dict={}
for script in joined:
    if script['practice'] not in join_dict:
        join_dict[script['practice']] = practice_postal[script['practice']]
        

In [16]:
print (join_dict)

{'N81013': 'SK11 6JL', 'N81029': 'SK11 6JL', 'N81062': 'SK11 6JL', 'N81085': 'SK11 6JL', 'N81088': 'SK11 6JL', 'N81632': 'SK11 6JL', 'Y02045': 'SK11 6JL', 'Y03882': 'SK11 6JL', 'N81010': 'CW5 5NX', 'N81015': 'CW1 3AW', 'N81016': 'CW1 3AW', 'N81047': 'CW5 5NX', 'N81053': 'CW1 3AW', 'N81090': 'CW5 5NX', 'Y03881': 'CW1 3AW', 'N81024': 'CW7 1AT', 'N81040': 'CW7 1AT', 'N81127': 'CW7 1AT', 'Y03880': 'CW1 3AW', 'N81023': 'CH65 6TG', 'N81079': 'CH1 4DS', 'N81080': 'CH1 4DS', 'N81091': 'CH65 6TG', 'N81093': 'CH65 6TG', 'N81102': 'CH1 4DS', 'N81121': 'CH1 4DS', 'Y03408': 'TS10 4NW', 'A81007': 'TS24 7PW', 'A81011': 'TS24 7PW', 'A81017': 'TS17 0EE', 'A81031': 'TS24 7PW', 'A81034': 'TS17 0EE', 'A81040': 'TS23 2DG', 'A81602': 'TS23 2DG', 'A81610': 'TS23 2DG', 'Y02496': 'TS24 7PW', 'A81012': 'TS3 6AL', 'A81018': 'TS10 4NW', 'A81023': 'TS1 2NX', 'A81029': 'TS1 2NX', 'A81033': 'TS3 6AL', 'A81037': 'TS1 2NX', 'A81038': 'TS3 6AL', 'A81049': 'TS3 6AL', 'A81052': 'TS10 4NW', 'A81064': 'TS1 2NX', 'Y00286': 

Finally we'll group the prescription dictionaries in `joined` by `'post_code'` and sum up the items prescribed in each group, as we did in the previous question.

In [17]:
#This prog is use to generate a dict {post_code:items}
#join_dict = {script['practice']:practice['post_code']}
#items_by_post = {post_code:items}
#items is in scripts
items_by_post= {}
for script in joined:
    items = script['items']
    post_code = join_dict[script['practice']]
    if post_code in items_by_post:
        items_by_post[post_code] += items
    else:
        items_by_post[post_code] = items

In [63]:
print(items_by_post)

{'SK11 6JL': 110071, 'CW5 5NX': 38797, 'CW1 3AW': 64104, 'CW7 1AT': 43164, 'CH65 6TG': 25090, 'CH1 4DS': 34915, 'TS10 4NW': 45161, 'TS24 7PW': 58207, 'TS17 0EE': 68388, 'TS23 2DG': 31646, 'TS3 6AL': 51402, 'TS1 2NX': 47623, 'OL4 1YN': 24687, 'OL1 1NL': 41046, 'BL9 0SN': 35275, 'WN7 1HR': 29076, 'BL3 5HP': 27147, 'BL1 8TU': 26132, 'M26 2SP': 37718, 'BL9 0NJ': 32062, 'M35 0AD': 37632, 'OL9 7AY': 28394, 'OL11 1DN': 21567, 'M30 0NU': 25597, 'M11 4EJ': 23166, 'SK6 1ND': 28313, 'WN3 5HL': 31149, 'WN2 5NG': 22795, 'WN7 2PE': 23130, 'BB2 1AX': 28254, 'BB3 1PY': 54514, 'FY2 0JG': 69118, 'FY4 1TJ': 62886, 'BB11 2DL': 34100, 'BB8 0JZ': 54380, 'BB9 7SR': 38224, 'BB7 2JG': 44585, 'BB4 5SL': 29388, 'BB12 8BS': 256, 'LA1 1PN': 15867, 'LA1 4JS': 16835, 'LA1 2LG': 14633, 'FY5 2TZ': 44258, 'FY7 8GU': 34473, 'WA7 1AB': 41314, 'L36 7XY': 22965, 'L31 0DJ': 25510, 'L31 8BP': 6555, 'WA10 2DJ': 40394, 'WA9 1LN': 29644, 'L7 6HD': 26592, 'L5 0QW': 24676, 'L8 6QP': 15977, 'NE24 1DX': 50491, 'NE33 4JP': 5620, 'NE

In [18]:
postal_total = sorted(items_by_post.items())
postal_total_100 = postal_total[0:100]
print(postal_total_100)


[('B11 4BW', 20673), ('B18 7AL', 19001), ('B21 9RY', 29103), ('B23 6DJ', 24859), ('B70 7AW', 36531), ('BB11 2DL', 34100), ('BB12 8BS', 256), ('BB2 1AX', 28254), ('BB3 1PY', 54514), ('BB4 5SL', 29388), ('BB7 2JG', 44585), ('BB8 0JZ', 54380), ('BB9 7SR', 38224), ('BD3 8QH', 21010), ('BH18 8EE', 39413), ('BH23 3AF', 32545), ('BL1 8TU', 26132), ('BL3 5HP', 27147), ('BL9 0NJ', 32062), ('BL9 0SN', 35275), ('CB9 8HF', 51337), ('CH1 4DS', 34915), ('CH65 6TG', 25090), ('CT11 8AD', 44358), ('CV1 4FS', 37210), ('CW1 3AW', 64104), ('CW5 5NX', 38797), ('CW7 1AT', 43164), ('DA1 2HA', 26075), ('DA11 8BZ', 24090), ('DN22 7XF', 43091), ('DN34 4GB', 45043), ('DN36 4QG', 2970), ('FY2 0JG', 69118), ('FY4 1TJ', 62886), ('FY5 2TZ', 44258), ('FY7 8GU', 34473), ('GL1 3PX', 38120), ('GL50 4DP', 74822), ('GU9 9QS', 32131), ('HA0 4UZ', 22755), ('HA3 7LT', 32113), ('HG1 5AR', 32684), ('HU7 4DW', 49107), ('KT14 6DH', 26758), ('KT6 6EZ', 38816), ('KT6 7QU', 159), ('L31 0DJ', 25510), ('L31 8BP', 6555), ('L36 7XY', 2

In [52]:
#postal_totals = [('B11 4BW', 20673)] * 100

grader.score.pw__postal_totals(postal_total_100)

Your score:  1.0


## Question 4: items_by_region

Now we'll combine the techniques we've developed to answer a more complex question. Find the most commonly dispensed item in each postal code, representing the results as a list of tuples (`post_code`, `bnf_name`, amount dispensed as proportion of total). Sort your results ascending alphabetically by post code and take only results from the first 100 post codes.

**NOTE:** We'll continue to use the `joined` variable we created before, where we've chosen the alphabetically first postal code for each practice. Additionally, some postal codes will have multiple `'bnf_name'` with the same number of items prescribed for the maximum. In this case, we'll take the alphabetically first `'bnf_name'`.

Now we need to calculate the total items of each `'bnf_name'` prescribed in each `'post_code'`. Use the techniques we developed in the previous questions to calculate these totals. You should have 141196 `('post_code', 'bnf_name')` groups.

In [19]:
""""post_code_list = []
for post in practices:
    if post['post_code'] not in post_code_list:
        post_code_list.append(post['post_code'])""

In [42]:
print(len(post_code_list))

8306


In [None]:
""""list1 = []
for script in joined:
    post = join_dict[script['practice']]
    bnf = script['bnf_name']
    if bnf not in list 1:
        list1.append(bnf)
    else:
        
    
    x = practice['post_code']
        if practice['post_code'] < x:
            practice_postal[practice['code']] = x"""

In [29]:
""""total_items_by_bnf_post = [(post,bnf)]
for script in joined:
    bnf = script['bnf_name']
    post = join_dict[script['practice']]
    if post in total_items_by_bnf_post:
        if total_items_by_bnf_post[post] != bnf:
            total_items_by_bnf_post.append((post:bnf))
    #else:
        #total_items_by_bnf_post.update({post:bnf})
    
#assert len(total_items_by_bnf_post) == 141196"""

SyntaxError: invalid syntax (<ipython-input-29-6903efb43df9>, line 7)

In [25]:
""""total_items_by_bnf_post_listval = {post:[] for post in post_code_list}
for script in joined:
    bnf = script['bnf_name']
    post = join_dict[script['practice']]
    list_val = total_items_by_bnf_post_listval[post]
    if post in total_items_by_bnf_post_listval:
        if bnf not in list_val:
            list_val.append (bnf)
    else:
        total_items_by_bnf_post_listval.update({post:[bnf]})
        
total_items_by_bnf_post = [(k,v) for k in total_items_by_bnf_post_listval for k,v in total_items_by_bnf_post_listval[k]]

assert len(total_items_by_bnf_post) == 141196"""

ValueError: too many values to unpack (expected 2)

Let's use `total_by_item_post` to find the maximum item total for each postal code. To do this, we will want to regroup `total_by_item_post` by `'post_code'` only, not by `('post_code', 'bnf_name')`. First let's turn `total_by_item_post` into a list of dictionaries (similar to `scripts` or `practices`) and then group it by `'post_code'`. You should have 118 groups in `total_by_item_post` after grouping it by `'post_code'`.

In [None]:
total_items = ...
assert len(total_items_by_post) == 118

Now we will aggregate the groups in `total_by_item_post` to create `max_item_by_post`. Some `'bnf_name'` have the same item total within a given postal code. Therefore, if more than one `'bnf_name'` has the maximum item total in a given postal code, we'll take the alphabetically first `'bnf_name'`. We can do this by [sorting](https://docs.python.org/2.7/howto/sorting.html) each group according to the item total and `'bnf_name'`.

In [None]:
max_item_by_post = ...

In order to express the item totals as a proportion of the total amount of items prescribed across all `'bnf_name'` in a postal code, we'll need to use the total items prescribed that previously calculated as `items_by_post`. Calculate the proportions for the most common `'bnf_names'` for each postal code. Format your answer as a list of tuples: `[(post_code, bnf_name, total)]`

In [7]:
practices = practices.sort_values('post_code')
practices = practices[~practices.duplicated(["code"])] #remove all duplicated and only first alphabetical post_code of codes
merged_df = scripts.merge(practices, left_on='practice', right_on='code') #merge col practice in scripts with code in practices

In [8]:
merged_df.head()

Unnamed: 0,act_cost,bnf_code,bnf_name,items,nic,practice,quantity,addr_1,addr_2,borough,code,name,post_code,village
0,5.56,0101010G0AAABAB,Co-Magaldrox_Susp 195mg/220mg/5ml S/F,2,5.98,N81013,1000,HIGH STREET SURGERY,WATERS GREEN MEDICAL CTR,SUNDERLAND STREET,N81013,HIGH STREET SURGERY,SK11 6JL,MACCLESFIELD CHESHIRE
1,1.82,0101021B0AAAHAH,Alginate_Raft-Forming Oral Susp S/F,1,1.95,N81013,500,HIGH STREET SURGERY,WATERS GREEN MEDICAL CTR,SUNDERLAND STREET,N81013,HIGH STREET SURGERY,SK11 6JL,MACCLESFIELD CHESHIRE
2,59.95,0101021B0AAALAL,Sod Algin/Pot Bicarb_Susp S/F,12,64.51,N81013,6300,HIGH STREET SURGERY,WATERS GREEN MEDICAL CTR,SUNDERLAND STREET,N81013,HIGH STREET SURGERY,SK11 6JL,MACCLESFIELD CHESHIRE
3,8.55,0101021B0AAAPAP,Sod Alginate/Pot Bicarb_Tab Chble 500mg,3,9.21,N81013,180,HIGH STREET SURGERY,WATERS GREEN MEDICAL CTR,SUNDERLAND STREET,N81013,HIGH STREET SURGERY,SK11 6JL,MACCLESFIELD CHESHIRE
4,26.84,0101021B0BEADAJ,Gaviscon Infant_Sach 2g (Dual Pack) S/F,6,28.92,N81013,90,HIGH STREET SURGERY,WATERS GREEN MEDICAL CTR,SUNDERLAND STREET,N81013,HIGH STREET SURGERY,SK11 6JL,MACCLESFIELD CHESHIRE


In [9]:
merged_df = merged_df.sort_values(["post_code","bnf_name"])

In [10]:
most_common = merged_df.groupby(['post_code', 'bnf_name'])["items"].sum()
common_item_100 = most_common.max(level = "post_code")[0:100]

In [11]:
## testing to get most common item of each post code
over_list = []
for i in merged_df["post_code"].unique():
    random_list = []
    random_list.append(most_common[i].nlargest(1).idxmax())
    over_list.append(random_list)

In [12]:
total_common = merged_df.groupby("post_code")["items"].sum()
total_item_100 = total_common[0:100]

In [13]:
amt_dispense_total = common_item_100/total_item_100

In [14]:
items_by_region = []
for i in merged_df["post_code"].unique()[0:100]:
    create_list = []
    create_list.append(i)
    create_list.append(most_common[i].nlargest(1).idxmax())
    create_list.append(amt_dispense_total[i])
    items_by_region.append(tuple(create_list))
items_by_region #I am a fucking genius 

[('B11 4BW', 'Salbutamol_Inha 100mcg (200 D) CFF', 0.03415082474725487),
 ('B18 7AL', 'Salbutamol_Inha 100mcg (200 D) CFF', 0.02926161780958897),
 ('B21 9RY', 'Metformin HCl_Tab 500mg', 0.03549462254750369),
 ('B23 6DJ', 'Lansoprazole_Cap 30mg (E/C Gran)', 0.024095900880968663),
 ('B70 7AW', 'Paracet_Tab 500mg', 0.0266896608360023),
 ('BB11 2DL', 'Omeprazole_Cap E/C 20mg', 0.02884503434625684),
 ('BB2 1AX', 'Omeprazole_Cap E/C 20mg', 0.03645501521908402),
 ('BB3 1PY', 'Omeprazole_Cap E/C 20mg', 0.03428477088454342),
 ('BB4 5SL', 'Omeprazole_Cap E/C 20mg', 0.040696883081529876),
 ('BB7 2JG', 'Omeprazole_Cap E/C 20mg', 0.029471795446899183),
 ('BB8 0JZ', 'Atorvastatin_Tab 20mg', 0.022563442442074293),
 ('BB9 7SR', 'Omeprazole_Cap E/C 20mg', 0.023833193804939305),
 ('BD3 8QH', 'Atorvastatin_Tab 40mg', 0.03422179914326511),
 ('BH18 8EE', 'Omeprazole_Cap E/C 20mg', 0.029000583563798747),
 ('BH23 3AF', 'Omeprazole_Cap E/C 20mg', 0.03733292364418497),
 ('BL1 8TU', 'Omeprazole_Cap E/C 20mg', 0

In [46]:
""""joined = scripts[:]
for script in joined:
    script['post_code'] = practice_postal[script['practice']]
    X=defaultdict(list)
    for join in joined:
        X[join['post_code']].append((join['items'],join['bnf_name']))
        for postcode , value_lst in dict(X).items():
            sum_lst=sum([i for i,_ in value_lst])
            X[postcode]=[(j/sum_lst,k) for j,k in value_lst]
        for postcode , value_lst in dict(X).items():
            X[postcode]=sorted(X[postcode],key=itemgetter(0),reverse=True)
            #X[postcode]=sorted(X[postcode],key=itemgetter(1))
            items_by_region =[]
        for post_code , value_lst in dict(X).items():
            items_by_region.append((post_code,value_lst[0][1],value_lst[0][0]))   
items_by_region=items_by_region[:100]"""

NameError: name 'defaultdict' is not defined

In [33]:
items_by_region = [('B11 4BW', 'Salbutamol_Inha 100mcg (200 D) CFF', 0.03415082474725487), ('B18 7AL', 'Salbutamol_Inha 100mcg (200 D) CFF', 0.02926161780958897), ('B21 9RY', 'Metformin HCl_Tab 500mg', 0.03549462254750369), ('B23 6DJ', 'Lansoprazole_Cap 30mg (E/C Gran)', 0.024095900880968663), ('B70 7AW', 'Paracet_Tab 500mg', 0.0266896608360023)]*20

In [16]:
grader.score.pw__items_by_region(items_by_region)

Your score:  1.0


*Copyright &copy; 2017 The Data Incubator.  All rights reserved.*