## CS 210 Spring 2024 - Feb 26
### Working with CSV Datasets, JSON files

In [1]:
import csv

### The UCI Iris Dataset - Continued

**1. Make sure there are exactly 5 fields in each row (Done on Feb 19)**

**2. Make sure all fields except last are real numbers**

In [2]:
with open('iris-messy.csv') as irisfile:
    reader = csv.reader(irisfile)
    next(reader)                           # skip first line of field names
    
    for num,row in enumerate(reader):
        if len(row) != 5:                  # lines that have too many or too few fields
            print(f'Row {(num+1):03}:',end='') 
            print(' Too few fields') if len(row) < 5 else print(' Too many fields')
            print('\t',row,'\n')
        else:
            for val in row[:-1]:           # skip last field
                try:
                    float(val)
                except:
                    print(f"Row {(num+1):03}: Non-numeric value '{val}'")
                    print('\t',row,'\n')

Row 009: Too many fields
	 ['4.4', '2', '9', '1.4', '0.2', 'Iris-setosa'] 

Row 013: Non-numeric value 'N/A'
	 ['4.8', 'N/A', '1.4', '0.1', 'Iris-setosa'] 

Row 035: Non-numeric value 'n/a'
	 ['4.9', '3.1', 'n/a', '0.1', 'Iris-setosa'] 

Row 036: Non-numeric value 'na'
	 ['5.0', 'na', '1.2', '0.2', 'Iris-setosa'] 

Row 043: Non-numeric value '?'
	 ['?', '3.2', '1.3', '0.2', 'Iris-setosa'] 

Row 064: Too few fields
	 ['6.1', '4.7', '1.4', 'Iris-versicolor'] 

Row 070: Non-numeric value 'NA'
	 ['5.6', '2.5', '3.9', 'NA', 'Iris-versicolor'] 

Row 077: Non-numeric value '?'
	 ['6.8', '2.8', '?', '1.4', 'Iris-versicolor'] 

Row 078: Too many fields
	 ['6.7', '3.0', '4.5', '1.7', '6.5', 'Iris-versicolor'] 

Row 103: Too many fields
	 ['7', '1', '3.0', '5.9', '2.1', 'Iris-virginica'] 

Row 113: Too few fields
	 ['6.8', '3.0', '5.5', '2.1'] 

Row 127: Non-numeric value '4x8'
	 ['6.2', '2.8', '4x8', '1.8', 'Iris-virginica'] 

Row 137: Non-numeric value '?'
	 ['6.3', '3.4', '5.6', '?', 'Iris-vir

**3. Finalize by writing out acceptable lines:**
- Skip lines that have too few or too many fields
- Replace non-numeric field with NA (standardize)

In [3]:
with open('iris-better.csv','w') as outfile:
    with open('iris-messy.csv') as irisfile:
        
        reader = csv.reader(irisfile)
        
        row = next(reader)                # read first line of field names
        outfile.write(','.join(row))
        outfile.write('\n')
    
        for num,row in enumerate(reader):
            if len(row) != 5:             # skip lines that have too many or too few fields
                continue
            
            outrow = []
            for val in row[:-1]:      # check all fields except last for numeric
                try:
                    float(val)
                    outrow.append(val)
                except:
                    outrow.append('NA')

            outrow.append(row[-1])    # last field, non-numeric string
            outfile.write(','.join(outrow))
            outfile.write('\n')

**Alternatively, you can use a CSV writer to write out**

In [20]:
with open('iris-better.csv','w',newline='') as csvfile:  # note the newline='' parameter
    writer = csv.writer(csvfile, delimiter=',')          # set outfile column delimiter to comma, which is the default
   
    with open('iris-messy.csv') as irisfile:
        
        reader = csv.reader(irisfile)
        
        row = next(reader)                               # first line of column names
        writer.writerow(row)                             # use writerow method of writer with list of columns as param
    
        for num,row in enumerate(reader):
            if len(row) != 5:                            # lines that have too many or too few columns
                continue
            
            outrow = []
            for val in row[:-1]:                         # check all fields except last for numeric
                try:
                    float(val)
                    outrow.append(val)
                except:
                    outrow.append('NA')
            outrow.append(row[-1])                      # last field, non-numeric string
            writer.writerow(outrow)

#### Example 3: Processing auto-mpg CSV file Using DictReader

In [6]:
# Using DictReader on csv
# This gives an OrderedDictionary for each row
reader = csv.DictReader(open('auto_mpg_original.csv'))
for index,row in enumerate(reader):
    print(row)
    if index > 3:
        break

{'mpg': '18.0', 'cylinders': '8.', 'displacement': '307.0', 'horsepower': '130.0', 'weight': '3504.', 'acceleration': '12.0', 'model year': '70.', 'origin': '1.', 'car name': 'chevrolet chevelle malibu'}
{'mpg': '15.0', 'cylinders': '8.', 'displacement': '350.0', 'horsepower': '165.0', 'weight': '3693.', 'acceleration': '11.5', 'model year': '70.', 'origin': '1.', 'car name': 'buick skylark 320'}
{'mpg': '18.0', 'cylinders': '8.', 'displacement': '318.0', 'horsepower': '150.0', 'weight': '3436.', 'acceleration': '11.0', 'model year': '70.', 'origin': '1.', 'car name': 'plymouth satellite'}
{'mpg': '16.0', 'cylinders': '8.', 'displacement': '304.0', 'horsepower': '150.0', 'weight': '3433.', 'acceleration': '12.0', 'model year': '70.', 'origin': '1.', 'car name': 'amc rebel sst'}
{'mpg': '17.0', 'cylinders': '8.', 'displacement': '302.0', 'horsepower': '140.0', 'weight': '3449.', 'acceleration': '10.5', 'model year': '70.', 'origin': '1.', 'car name': 'ford torino'}


*Note that the double quotes around the car name have been stripped off*

**Print header column names, and all lines that have an NA for any of the fields**

In [8]:
# using fieldnames and values methods
reader = csv.DictReader(open('auto_mpg_original.csv'))
print(reader.fieldnames)
print(','.join(reader.fieldnames))
for row in reader:
    values = list(row.values())  # row.values() is an odict.values object, need to cast to list
    if 'NA' in values:
        values[-1] = '"' + values[-1] + '"'
        print(','.join(values))

['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'car name']
mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
NA,4.,133.0,115.0,3090.,17.5,70.,2.,"citroen ds-21 pallas"
NA,8.,350.0,165.0,4142.,11.5,70.,1.,"chevrolet chevelle concours (sw)"
NA,8.,351.0,153.0,4034.,11.0,70.,1.,"ford torino (sw)"
NA,8.,383.0,175.0,4166.,10.5,70.,1.,"plymouth satellite (sw)"
NA,8.,360.0,175.0,3850.,11.0,70.,1.,"amc rebel sst (sw)"
NA,8.,302.0,140.0,3353.,8.0,70.,1.,"ford mustang boss 302"
25.0,4.,98.00,NA,2046.,19.0,71.,1.,"ford pinto"
NA,4.,97.00,48.00,1978.,20.0,71.,2.,"volkswagen super beetle 117"
21.0,6.,200.0,NA,2875.,17.0,74.,1.,"ford maverick"
40.9,4.,85.00,NA,1835.,17.3,80.,2.,"renault lecar deluxe"
23.6,4.,140.0,NA,2905.,14.3,80.,1.,"ford mustang cobra"
34.5,4.,100.0,NA,2320.,15.8,81.,2.,"renault 18i"
NA,4.,121.0,110.0,2800.,15.4,81.,2.,"saab 900s"
23.0,4.,151.0,NA,3035.,20.5,82.,1.,"amc concord dl"


**Write out a cleaned up version into a CSV file**

In [9]:
reader = csv.DictReader(open('auto_mpg_original.csv'))  # input, has a bunch of NAs for values
with open('auto_mpg.csv','w') as csvfile:               # output, delete lines with NA for any value
    csvfile.write(','.join(reader.fieldnames)+'\n')     # header line with field names    
    for row in reader:
        if 'NA' in row.values():
            continue
        values = list(row.values())
        csvfile.write(','.join(values)+'\n')


**Alternatively, you can use a CSV DictWriter writer to write out**

In [10]:
with open('auto_mpg_original.csv') as csvfile: 
    reader = csv.DictReader(csvfile)
    
    with open('auto_mpg.csv','w',newline='') as csvout:
        # fieldnames is a required parameter for DictWriter
        writer = csv.DictWriter(csvout,fieldnames=reader.fieldnames, delimiter='\t')  
        writer.writeheader()   
        for row in reader:
            if 'NA' in row.values():
                continue
            writer.writerow(row)  

---

### <font color="brown">Working with JSON (JavaScript Object Notation) Datasets</font>

In [2]:
import json

---

#### <font color="brown">Loading a JSON-formatted string into a JSON object</font>

In [3]:
json1 = '{"hill center":"Busch", "AB":"College Ave"}'   # a string containing dictionary formatted data
# load this into Python
dict1 = json.loads(json1)
print(dict1)
print(dict1.keys())
print(dict1.values())

{'hill center': 'Busch', 'AB': 'College Ave'}
dict_keys(['hill center', 'AB'])
dict_values(['Busch', 'College Ave'])


In [4]:
json1 = {"hill center":"Busch", "AB":"College Ave"}   
# load this into Python
dict1 = json.loads(json1)
print(dict1)

TypeError: the JSON object must be str, bytes or bytearray, not dict

**Above doesn't work - the input must be a string, so need quotes around the whole thing**

In [5]:
json1 = '{2:"Busch", 1:"College Ave"}'  
# load this into Python
dict1 = json.loads(json1)
print(dict1)

JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

**Keys are required to be strings, so the numbers 2 and 1 as keys are rejected**

**But values are not required to be strings**

In [6]:
json1 = '{"John":12, "Jane":25}'   # but values need not be strings
# load this into Python
dict1 = json.loads(json1)
print(dict1)

{'John': 12, 'Jane': 25}


In [7]:
x =  '{ "name":"John", "age":30, "city":"New York"}'
y = json.loads(x)
print(y)
print(y["age"])

{'name': 'John', 'age': 30, 'city': 'New York'}
30


**Key strings are required to be double-quoted**

In [8]:
x =  "{ 'name':'John', 'age':30, 'city':'New York'}"
y = json.loads(x)

JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)

**Above doesn't work, because key strings are required to be double-quoted**

---

#### <font color="brown">Dumping a dictonary to JSON-formatted string</font>

In [9]:
dat_dict = { 'name' : 'Jane', 'age' : 25, 'city' : 'Chicago'}
dat_str = json.dumps(dat_dict)
print(dat_str)

{"name": "Jane", "age": 25, "city": "Chicago"}


In [10]:
# a dictionary with integers for keys
dict2 = {2: 'busch', 1: 'college ave'}
print(dict2)

{2: 'busch', 1: 'college ave'}


In [11]:
# dump to string 
dict2_str = json.dumps(dict2)
print(dict2_str)  

{"2": "busch", "1": "college ave"}


**<font color="red">1. When dumping, integer keys converted to strings, single-quoted strings are double-quoted</font>**

In [12]:
dict2_new = json.loads(dict2_str)
print(dict2_new)   

{'2': 'busch', '1': 'college ave'}


**<font color="red">2. So when loading back, dict keys change to strings so dict2 is NOT the same as dict2_new<font>**

---

#### <font color="brown">Using arrays as values</font>

In [13]:
# array of integers
json3 = '{"name": "Anika", "quiz_scores":[38,40,36,40,32]}'
dict3 = json.loads(json3)
print(dict3['quiz_scores'][2])

36


In [14]:
# array of dictionaries
json4 = '{"quiz_scores" : [{"name": "Anika", "scores": [38,40,36,40,32]}, {"name": "Amir", "scores":[36,38,40,30,34]}]}'
dict4 = json.loads(json4)
print(dict4)

{'quiz_scores': [{'name': 'Anika', 'scores': [38, 40, 36, 40, 32]}, {'name': 'Amir', 'scores': [36, 38, 40, 30, 34]}]}


In [15]:
print(dict4['quiz_scores'][1]['name'])  # name of second item in quiz_scores value array
print(dict4['quiz_scores'][0]['scores'][3])  # 4th score of first item in quiz_scores value array

Amir
40


---

#### <font color="brown">Storing JSON to file</font>

In [16]:
# dump to file
with open ("quiz_scores.json","w") as qsfile:
    json.dump(dict4, qsfile)

**If you open the quiz_scores.json file with Editor, it looks like this:**

{"quiz_scores": [{"name": "Anika", "scores": [38, 40, 36, 40, 32]}, {"name": "Amir", "scores": [36, 38, 40, 30, 34]}]}

**But if you just double-click on it (same as open with JSON), it will show you a hierarchical structure that is much easier to comprehend. Try it out for yourself.**

In [17]:
# load from file
with open("quiz_scores.json") as qsfile:
    qs_scores = json.load(qsfile)

In [18]:
print(qs_scores)

{'quiz_scores': [{'name': 'Anika', 'scores': [38, 40, 36, 40, 32]}, {'name': 'Amir', 'scores': [36, 38, 40, 30, 34]}]}


---

#### <font color="brown">JSON with just a string (no dictionary)</font>

In [19]:
jsonstr = json.loads('"JSON - JavaScript Object Notation"')
jsonstr

'JSON - JavaScript Object Notation'

**Note that the string must be double-quoted since the JSON format itself requires a single-quote around the whole contents**

---

#### <font color="brown">JSON with just an array</font>

In [21]:
jsonarr = json.loads('[1,2,2,4]')
print(jsonarr)
print(len(jsonarr))
print(type(jsonarr))

[1, 2, 2, 4]
4
<class 'list'>


---

#### <font color="brown">JSON with just a number</font>

In [22]:
jsonint = json.loads('25')
print(type(jsonint))
jsonreal = json.loads('25.3')
print(type(jsonreal))

<class 'int'>
<class 'float'>


In [23]:
json.loads('12.x')

JSONDecodeError: Extra data: line 1 column 3 (char 2)

In [24]:
json.loads('"12.x"')   # this is a string

'12.x'

---

#### <font color="brown">JSON with just a boolean</font>

In [25]:
jsonbool = json.loads('true')   # must be lowercase
print(jsonbool)
print(type(jsonbool))

True
<class 'bool'>


---

#### <font color="brown">JSON with a null</font>

In [26]:
jsonnull = json.loads('null')
print(jsonnull)
print(type(jsonnull))

None
<class 'NoneType'>


---

#### Exercise: ad hoc file format converted to JSON

Suppose scores were in a file *qs_scores.txt*, like this:
    
Anika Sorenson|38,40,36,40,32<br>
Amir Sharif|36,38,40,30,34

We want to store this in JSON form so that it is standardized

In [28]:
# make an input text file, qs-scores.txt
qs_dict = {}
for line in open('qs_scores.txt'):
    flds = line.split('|')
    scores = flds[1].strip().split(',')
    qs_scores = [int(qs) for qs in scores]
    qs_dict[flds[0].strip()] = qs_scores
print(qs_dict)

{'Anika Sorenson': [38, 40, 36, 40, 32], 'Amir Sharif': [36, 38, 40, 30, 34]}


In [29]:
with open('qs_scores.json','w') as qsfile:
    json.dump(qs_dict, qsfile)

# double-click the output file, will open in json interpretation mode
# right-click -> open with editor, can see plain text

---

#### <font color="brown">Getting JSON data from a Web page<font>

In [30]:
import requests

#### Example of reading public JSON dataset

Nobel Prizes - http://api.nobelprize.org/v1/prize.json

In [39]:
nobel_url = 'http://api.nobelprize.org/v1/prize.json'
resp = requests.get(nobel_url)
nobels = json.loads(resp.text)
with open('nobels.json','w') as nobels_file:
    json.dump(nobels, nobels_file)

**nobels is a dictionary with a single key, 'prizes'<br>Explore the nobels.json file to get a feel for the data structure**

In [32]:
print(nobels.keys())

dict_keys(['prizes'])


**the value for 'prizes' is a list**

In [33]:
len(nobels['prizes'])

670

**list is of length 670, one item per prize**

In [34]:
# print nobel prizes for the latest year (2023)
print(nobels['prizes'][0])

{'year': '2023', 'category': 'chemistry', 'laureates': [{'id': '1029', 'firstname': 'Moungi', 'surname': 'Bawendi', 'motivation': '"for the discovery and synthesis of quantum dots"', 'share': '3'}, {'id': '1030', 'firstname': 'Louis', 'surname': 'Brus', 'motivation': '"for the discovery and synthesis of quantum dots"', 'share': '3'}, {'id': '1031', 'firstname': 'Aleksey', 'surname': 'Yekimov', 'motivation': '"for the discovery and synthesis of quantum dots"', 'share': '3'}]}


**Each list item is a dictionary**

**<font color="brown">Get all prizes awarded in the year 2021</font>**

In [35]:
nobels_2021 = [prize for prize in nobels['prizes'] if prize['year'] == '2021']
print(nobels_2021)

[{'year': '2021', 'category': 'chemistry', 'laureates': [{'id': '1002', 'firstname': 'Benjamin', 'surname': 'List', 'motivation': '"for the development of asymmetric organocatalysis"', 'share': '2'}, {'id': '1003', 'firstname': 'David', 'surname': 'MacMillan', 'motivation': '"for the development of asymmetric organocatalysis"', 'share': '2'}]}, {'year': '2021', 'category': 'economics', 'laureates': [{'id': '1007', 'firstname': 'David', 'surname': 'Card', 'motivation': '"for his empirical contributions to labour economics"', 'share': '2'}, {'id': '1008', 'firstname': 'Joshua', 'surname': 'Angrist', 'motivation': '"for their methodological contributions to the analysis of causal relationships"', 'share': '4'}, {'id': '1009', 'firstname': 'Guido', 'surname': 'Imbens', 'motivation': '"for their methodological contributions to the analysis of causal relationships"', 'share': '4'}]}, {'year': '2021', 'category': 'literature', 'laureates': [{'id': '1004', 'firstname': 'Abdulrazak', 'surname':

**This is TMI and quite hard to read, we want to write in a user-friendly format**

**We want:**<br>
    Chemistry: name1, name2 ...<br>
    Economics: name1, name2 ...


In [36]:
for prize in nobels_2021:
    print(prize['category'].capitalize() + ': ',end='')
    winners = [winner['firstname']+' '+winner['surname'] for winner in prize['laureates']]
    print(', '.join(winners))

Chemistry: Benjamin List, David MacMillan
Economics: David Card, Joshua Angrist, Guido Imbens
Literature: Abdulrazak Gurnah
Peace: Maria Ressa, Dmitry Muratov
Physics: Syukuro Manabe, Klaus Hasselmann, Giorgio Parisi
Medicine: David Julius, Ardem Patapoutian


**<font color="brown">Get all prizes awarded in the year 2020</font>**

In [37]:
nobels_2020 = [prize for prize in nobels['prizes'] if prize['year'] == '2020']
for prize in nobels_2020:
    print(prize['category'].capitalize() + ': ',end='')
    winners = [winner['firstname']+' '+winner['surname'] for winner in prize['laureates']]
    print(', '.join(winners))

Chemistry: Emmanuelle Charpentier, Jennifer A. Doudna
Economics: Paul Milgrom, Robert Wilson
Literature: Louise Glück
Peace: 

KeyError: 'surname'

**Surname missing in Peace prize, could be missing in other years as well<br>
Use dict get method with default return of empty string if key not found**

In [38]:
for prize in nobels_2020:
    print(prize['category'].capitalize() + ': ',end='')
    winners = [winner['firstname']+' '+winner.get('surname','') for winner in prize['laureates']]
    print(', '.join(winners))

Chemistry: Emmanuelle Charpentier, Jennifer A. Doudna
Economics: Paul Milgrom, Robert Wilson
Literature: Louise Glück
Peace: World Food Programme 
Physics: Roger Penrose, Reinhard Genzel, Andrea Ghez
Medicine: Harvey Alter, Michael Houghton, Charles Rice


---

#### <font color="brown">Basic JSON structure: https://www.json.org/json-en.html</font>
As the description says at the top, JSON is built on two structures, which in Python vernacular are dictionaries (key-value pairs), and lists (arrays).