### <font color="brown">Working with CSV Datasets</font>

In [6]:
import csv

#### Example 3: Processing auto-mpg CSV file Using DictReader

In [7]:
# using DictReader on csv
reader = csv.DictReader(open('auto_mpg_original.csv'))
for index,row in enumerate(reader):
    print(row)
    if index > 3:
        break

OrderedDict([('mpg', '18.0'), ('cylinders', '8.'), ('displacement', '307.0'), ('horsepower', '130.0'), ('weight', '3504.'), ('acceleration', '12.0'), ('model year', '70.'), ('origin', '1.'), ('car name', 'chevrolet chevelle malibu')])
OrderedDict([('mpg', '15.0'), ('cylinders', '8.'), ('displacement', '350.0'), ('horsepower', '165.0'), ('weight', '3693.'), ('acceleration', '11.5'), ('model year', '70.'), ('origin', '1.'), ('car name', 'buick skylark 320')])
OrderedDict([('mpg', '18.0'), ('cylinders', '8.'), ('displacement', '318.0'), ('horsepower', '150.0'), ('weight', '3436.'), ('acceleration', '11.0'), ('model year', '70.'), ('origin', '1.'), ('car name', 'plymouth satellite')])
OrderedDict([('mpg', '16.0'), ('cylinders', '8.'), ('displacement', '304.0'), ('horsepower', '150.0'), ('weight', '3433.'), ('acceleration', '12.0'), ('model year', '70.'), ('origin', '1.'), ('car name', 'amc rebel sst')])
OrderedDict([('mpg', '17.0'), ('cylinders', '8.'), ('displacement', '302.0'), ('horsepo

**Print header column names, and all lines that have an NA for any of the fields**

In [8]:
# using fieldnames and values methods
reader = csv.DictReader(open('auto_mpg_original.csv'))
print(reader.fieldnames)
print(','.join(reader.fieldnames))
for row in reader:
    values = list(row.values())  # need to cast row.values() to list
    if 'NA' in values:
        values[-1] = '"' + values[-1] + '"'
        print(','.join(values))

['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'car name']
mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
NA,4.,133.0,115.0,3090.,17.5,70.,2.,"citroen ds-21 pallas"
NA,8.,350.0,165.0,4142.,11.5,70.,1.,"chevrolet chevelle concours (sw)"
NA,8.,351.0,153.0,4034.,11.0,70.,1.,"ford torino (sw)"
NA,8.,383.0,175.0,4166.,10.5,70.,1.,"plymouth satellite (sw)"
NA,8.,360.0,175.0,3850.,11.0,70.,1.,"amc rebel sst (sw)"
NA,8.,302.0,140.0,3353.,8.0,70.,1.,"ford mustang boss 302"
25.0,4.,98.00,NA,2046.,19.0,71.,1.,"ford pinto"
NA,4.,97.00,48.00,1978.,20.0,71.,2.,"volkswagen super beetle 117"
21.0,6.,200.0,NA,2875.,17.0,74.,1.,"ford maverick"
40.9,4.,85.00,NA,1835.,17.3,80.,2.,"renault lecar deluxe"
23.6,4.,140.0,NA,2905.,14.3,80.,1.,"ford mustang cobra"
34.5,4.,100.0,NA,2320.,15.8,81.,2.,"renault 18i"
NA,4.,121.0,110.0,2800.,15.4,81.,2.,"saab 900s"
23.0,4.,151.0,NA,3035.,20.5,82.,1.,"amc concord dl"


**Write out a cleaned up version into a CSV file**

In [97]:
reader = csv.DictReader(open('auto_mpg_original.csv'))  # input, has a bunch of NAs for values
with open('auto_mpg.csv','w') as csvfile:               # output, delete lines with NA for any value
    csvfile.write(','.join(reader.fieldnames)+'\n')     # header line with field names    
    for row in reader:
        if 'NA' in row.values():
            continue
        values = list(row.values())
        csvfile.write(','.join(values)+'\n')


**Alternatively, you can use a CSV DictWriter writer to write out**

In [96]:
with open('auto_mpg_original.csv') as csvfile: 
    reader = csv.DictReader(csvfile)
    
    with open('auto_mpg.csv','w',newline='') as csvout:
        # fieldnames is a required parameter for DictWriter
        writer = csv.DictWriter(csvout,fieldnames=reader.fieldnames, delimiter='\t')  
        writer.writeheader()   
        for row in reader:
            if 'NA' in row.values():
                continue
            writer.writerow(row)  

---

### <font color="brown">Working with JSON (JavaScript Object Notation) Datasets</font>

In [14]:
import json

---

#### <font color="brown">Loading a JSON-formatted string into a JSON object</font>

In [16]:
json1 = '{"hill center":"Busch", "AB":"College Ave"}'   # a string containing dictionary formatted data
# load this into Python
dict1 = json.loads(json1)
print(dict1)
print(dict1.keys())
print(dict1.values())

{'hill center': 'Busch', 'AB': 'College Ave'}
dict_keys(['hill center', 'AB'])
dict_values(['Busch', 'College Ave'])


In [18]:
json1 = {"hill center":"Busch", "AB":"College Ave"}   
# load this into Python
dict1 = json.loads(json1)
print(dict1)

TypeError: the JSON object must be str, bytes or bytearray, not dict

**Above doesn't work - the input must be a string, so need quotes around the whole thing**

In [19]:
json1 = '{2:"Busch", 1:"College Ave"}'  
# load this into Python
dict1 = json.loads(json1)
print(dict1)

JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

**Keys are required to be strings, so the numbers 2 and 1 as keys are rejected**

**But values are not required to be strings**

In [20]:
json1 = '{"John":12, "Jane":25}'   # but values need not be strings
# load this into Python
dict1 = json.loads(json1)
print(dict1)

{'John': 12, 'Jane': 25}


In [21]:
x =  '{ "name":"John", "age":30, "city":"New York"}'
y = json.loads(x)
print(y)
print(y["age"])

{'name': 'John', 'age': 30, 'city': 'New York'}
30


**Key strings are required to be double-quoted**

In [22]:
x =  "{ 'name':'John', 'age':30, 'city':'New York'}"
y = json.loads(x)

JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)

**Above doesn't work, because key strings are required to be double-quoted**

---

#### <font color="brown">Dumping a dictonary to JSON-formatted string</font>

In [23]:
dat_dict = { 'name' : 'Jane', 'age' : 25, 'city' : 'Chicago'}
dat_str = json.dumps(dat_dict)
print(dat_str)

{"name": "Jane", "age": 25, "city": "Chicago"}


In [24]:
# a dictionary with integers for keys
dict2 = {2: 'busch', 1: 'college ave'}
print(dict2)

{2: 'busch', 1: 'college ave'}


In [26]:
# dump to string 
dict2_str = json.dumps(dict2)
print(dict2_str)  

{"2": "busch", "1": "college ave"}


**<font color="red">1. When dumping, integer keys converted to strings, single-quoted strings are double-quoted</font>**

In [28]:
dict2_new = json.loads(dict2_str)
print(dict2_new)   

{'2': 'busch', '1': 'college ave'}


**<font color="red">2. So when loading back, dict keys change to strings so dict2 is NOT the same as dict2_new<font>**

---

#### <font color="brown">Using arrays as values</font>

In [34]:
# array of integers
json3 = '{"name": "Anika", "quiz_scores":[38,40,36,40,32]}'
dict3 = json.loads(json3)
print(dict3['quiz_scores'][2])

36


In [35]:
# array of dictionaries
json4 = '{"quiz_scores" : [{"name": "Anika", "scores": [38,40,36,40,32]}, {"name": "Amir", "scores":[36,38,40,30,34]}]}'
dict4 = json.loads(json4)
print(dict4)

{'quiz_scores': [{'name': 'Anika', 'scores': [38, 40, 36, 40, 32]}, {'name': 'Amir', 'scores': [36, 38, 40, 30, 34]}]}


In [36]:
print(dict4['quiz_scores'][1]['name'])  # name of second item in quiz_scores value array
print(dict4['quiz_scores'][0]['scores'][3])  # 4th score of first item in quiz_scores value array

Amir
40


---

#### <font color="brown">Storing JSON to file</font>

In [37]:
# dump to file
with open ("quiz_scores.json","w") as qsfile:
    json.dump(dict4, qsfile)

In [38]:
# load from file
with open("quiz_scores.json") as qsfile:
    qs_scores = json.load(qsfile)

In [39]:
print(qs_scores)

{'quiz_scores': [{'name': 'Anika', 'scores': [38, 40, 36, 40, 32]}, {'name': 'Amir', 'scores': [36, 38, 40, 30, 34]}]}


---

#### <font color="brown">JSON with just a string (no dictionary)</font>

In [40]:
# string must be double-quoted
jsonstr = json.loads('"JSON - JavaScript Object Notation"')
jsonstr

'JSON - JavaScript Object Notation'

---

#### <font color="brown">JSON with just an array</font>

In [42]:
jsonarr = json.loads('[1,2,2,4]')
print(jsonarr)
print(len(jsonarr))

[1, 2, 2, 4]
4


---

#### <font color="brown">JSON with just a number</font>

In [43]:
jsonint = json.loads('25')
print(type(jsonint))
jsonreal = json.loads('25.3')
print(type(jsonreal))

<class 'int'>
<class 'float'>


In [44]:
json.loads('12.x')

JSONDecodeError: Extra data: line 1 column 3 (char 2)

In [45]:
json.loads('"12.x"')   # this is a string

'12.x'

---

#### <font color="brown">JSON with just a boolean</font>

In [47]:
jsonbool = json.loads('true')   # must be lowercase
print(jsonbool)
print(type(jsonbool))

True
<class 'bool'>


---

#### <font color="brown">JSON with a null</font>

In [49]:
jsonnull = json.loads('null')
print(jsonnull)
print(type(jsonnull))

None
<class 'NoneType'>


---

#### Exercise: ad hoc format converted storage to JSON

Suppose scores were in a file *qs_scores.txt*, like this:
    
Anika Sorenson|38,40,36,40,32<br>
Amir Sharif|36,38,40,30,34

We want to store this in JSON form so that it is standardized

In [55]:
# make an input text file, qs-scores.txt
qs_dict = {}
for line in open('qs_scores.txt'):
    flds = line.split('|')
    scores = flds[1].strip().split(',')
    qs_scores = [int(qs) for qs in scores]
    qs_dict[flds[0].strip()] = qs_scores
print(qs_dict)

{'Anika Sorenson': [38, 40, 36, 40, 32], 'Amir Sharif': [36, 38, 40, 30, 34]}


In [56]:
with open('qs_scores.json','w') as qsfile:
    json.dump(qs_dict, qsfile)

# double-click the output file, will open in json interpretation mode
# right-click -> open with editor, can see plain text

---

#### <font color="brown">Getting JSON data from a Web page<font>

In [57]:
import requests

#### Example of reading public JSON dataset

Nobel Prizes - http://api.nobelprize.org/v1/prize.json

In [58]:
nobel_url = ' http://api.nobelprize.org/v1/prize.json'
resp = requests.get(nobel_url)
nobels = json.loads(resp.text)

**nobels is a dictionary with a single key, 'prizes'**

In [61]:
print(nobels.keys())

dict_keys(['prizes'])


**the value for 'prizes' is a list**

In [62]:
len(nobels['prizes'])

658

**list is of length 658, one item per prize**

In [64]:
print(nobels['prizes'][0])

{'year': '2021', 'category': 'chemistry', 'laureates': [{'id': '1002', 'firstname': 'Benjamin', 'surname': 'List', 'motivation': '"for the development of asymmetric organocatalysis"', 'share': '2'}, {'id': '1003', 'firstname': 'David', 'surname': 'MacMillan', 'motivation': '"for the development of asymmetric organocatalysis"', 'share': '2'}]}


**each list item is a dictionary**

**<font color="brown">Get all prizes awarded in the year 2021</font>**

In [66]:
nobels_2021 = [prize for prize in nobels['prizes'] if prize['year'] == '2021']
print(nobels_2021)

[{'year': '2021', 'category': 'chemistry', 'laureates': [{'id': '1002', 'firstname': 'Benjamin', 'surname': 'List', 'motivation': '"for the development of asymmetric organocatalysis"', 'share': '2'}, {'id': '1003', 'firstname': 'David', 'surname': 'MacMillan', 'motivation': '"for the development of asymmetric organocatalysis"', 'share': '2'}]}, {'year': '2021', 'category': 'economics', 'laureates': [{'id': '1007', 'firstname': 'David', 'surname': 'Card', 'motivation': '"for his empirical contributions to labour economics"', 'share': '2'}, {'id': '1008', 'firstname': 'Joshua', 'surname': 'Angrist', 'motivation': '"for their methodological contributions to the analysis of causal relationships"', 'share': '4'}, {'id': '1009', 'firstname': 'Guido', 'surname': 'Imbens', 'motivation': '"for their methodological contributions to the analysis of causal relationships"', 'share': '4'}]}, {'year': '2021', 'category': 'literature', 'laureates': [{'id': '1004', 'firstname': 'Abdulrazak', 'surname':

**This is TMI and quite hard to read, we want to write in a user-friendly format**

**We want:**<br>
    Chemistry: name1, name2 ...<br>
    Economics: name1, name2 ...


In [67]:
for prize in nobels_2021:
    print(prize['category'].capitalize() + ': ',end='')
    winners = [winner['firstname']+' '+winner['surname'] for winner in prize['laureates']]
    print(', '.join(winners))

Chemistry: Benjamin List, David MacMillan
Economics: David Card, Joshua Angrist, Guido Imbens
Literature: Abdulrazak Gurnah
Peace: Maria Ressa, Dmitry Muratov
Physics: Syukuro Manabe, Klaus Hasselmann, Giorgio Parisi
Medicine: David Julius, Ardem Patapoutian


**<font color="brown">Get all prizes awarded in the year 2020</font>**

In [68]:
nobels_2020 = [prize for prize in nobels['prizes'] if prize['year'] == '2020']
for prize in nobels_2020:
    print(prize['category'].capitalize() + ': ',end='')
    winners = [winner['firstname']+' '+winner['surname'] for winner in prize['laureates']]
    print(', '.join(winners))

Chemistry: Emmanuelle Charpentier, Jennifer A. Doudna
Economics: Paul Milgrom, Robert Wilson
Literature: Louise Glück
Peace: 

KeyError: 'surname'

**Surname missing in Peace prize, could be missing in other years as well<br>
Use dict get method with default return of empty string if key not found**

In [69]:
for prize in nobels_2020:
    print(prize['category'].capitalize() + ': ',end='')
    winners = [winner['firstname']+' '+winner.get('surname','') for winner in prize['laureates']]
    print(', '.join(winners))

Chemistry: Emmanuelle Charpentier, Jennifer A. Doudna
Economics: Paul Milgrom, Robert Wilson
Literature: Louise Glück
Peace: World Food Programme 
Physics: Roger Penrose, Reinhard Genzel, Andrea Ghez
Medicine: Harvey Alter, Michael Houghton, Charles Rice


---

#### <font color="brown">Basic JSON structure: https://www.json.org/json-en.html</font>
As the description says at the top, JSON is built on two structures (using Python corresponding terminology: dictionary (key-value pairs), and lists (arrays)