# DX 602 Week 4 Homework



## Introduction

In this homework, you will practice working with strings to represent data, and reading and writing files to access and store data.

You may find it helpful to refer to this GitHub repository of Jupyter notebooks for sample code.

* https://github.com/bu-cds-omds/dx602-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Instructions

You should replace every instance of "..." below.
These are where you are expected to write code to answer each problem.

After some of the problems, there are extra code cells that will test functions that you wrote so you can quickly see how they run on an example.
If your code works on these examples, it is more likely to be correct.
However, the autograder will test different examples, so working correctly on these examples does not guarantee full credit for the problem.
You may change the example inputs to further test your functions on your own.
You may also add your own example inputs for problems where we did not provide any.

Be sure to run each code block after you edit it to make sure it runs as expected.
When you are done, we strongly recommend you run all the code from scratch (Runtime menu -> Restart and Run all) to make sure your current code works for all problems.

If your code raises an exception when run from scratch, it will  interfere with the auto-grader process causing you to lose some or all points for this homework.
Please ask for help in YellowDig or schedule an appointment with a learning facilitator if you get stuck.


#### Shared Imports

These common imports will be useful for some problems.
You may add other imports, but you should not try to install new modules not available in our Codespaces environment.

In [74]:
import csv
import json
import re

### Problem 1

Set the variable `p1` to the length of the string `q1`.

In [75]:
# DO NOT CHANGE

q1 = "Hello, I am a robot living in this notebook."

In [76]:
# YOUR CHANGES HERE

p1 = len(q1)

### Problem 2

The variable `q2` below contains a line of text from a comma-separated value file.
Set `p2` to the number of fields in `q2`.

Hint: Is there a string method that can separate the fields in `q2` into a list?

In [138]:
# DO NOT CHANGE

q2 = "35,Hello,red,153.2,n/a,true,true,true,false,154,92,2024-09-01,2020-03-06,confirmed,n/a,F,100,100\n"

In [139]:
# YOUR CHANGES HERE

# 1. Use the .split(',') method to break the string into a list of fields based on the comma delimiter.
# 2. Use the len() function to count the number of elements in the resulting list.
# 3. Assign the count back to the variable q2.
p2 = len(q2.split(','))

In [140]:
p2

18

### Problem 3

Write a function `p3` that takes in a string as input, and returns `True` if the string contains "silly" and the length of the string is at most 50 characters and `False` otherwise.

In [80]:
# YOUR CHANGES HERE

def p3(joke):
    # Check if the string "silly" is in the input string 'joke'
    # AND if the length of 'joke' is less than or equal to 50
    if "silly" in joke and len(joke) <= 50:
        return True
    else:
        return False

# A more concise way to write the function:
# def p3(joke):
#     return "silly" in joke and len(joke) <= 50

# Example usage:
# print(p3("This is a silly joke."))  # True
# print(p3("This is a serious joke.")) # False
# print(p3("silly" + "a"*50))        # False (length is 55)
# print(p3("silly" + "a"*40))        # True (length is 45)

In [81]:
p3("this is a silly joke")

True

In [82]:
p3("this joke is so boring because it drones on and on and on and on forever as if noone is really reading this amirite?")

False

### Problem 4

In the video "Parsing Numbers from Strings", you saw the `ord` function used to map individual characters to their character codes.
You can reverse this operation with the `chr` function.

In [83]:
ord('🦄')

129412

In [84]:
chr(129412)

'🦄'

The Unicode characters with codes 128200, 128201, and 128202 are all emoji related to data science.
Set `p4` to the concatenation of these three emoji characters together.
That is, make `p4` with those three characters in that order and just those three characters.

In [85]:
# YOUR CHANGES HERE

# Use chr() to convert each integer code point to a character, 
# and then use the '+' operator to concatenate them.
p4 = chr(128200) + chr(128201) + chr(128202)

In [86]:
p4

'📈📉📊'

### Problem 5

Set `p5` to be a copy of the variable `q5` after replacing "jumped over" with "greeted" and "lazy" with "friendly".

In [87]:
# DO NOT CHANGE

q5 = "The quick brown fox jumped over the lazy brown dog."

In [88]:
# YOUR CHANGES HERE

# First, replace "jumped over" with "greeted".
# Second, replace "lazy" with "friendly" on the result of the first replacement.
p5 = q5.replace("jumped over", "greeted").replace("lazy", "friendly")

In [89]:
p5

'The quick brown fox greeted the friendly brown dog.'

### Problem 6

Write a function `p6` that takes in a filename as an argument, reads it as a TSV with a header row, and returns an iterator of dictionaries like in the example code.
Each value should be parsed as an integer.
If a value does not parse successfully, set the value to None.

In [90]:
# YOUR CHANGES HERE

def p6(filename):
    """
    Reads a TSV file with a header row and returns an iterator of dictionaries.
    Each dictionary's value is parsed as an integer; if parsing fails, it is set to None.
    """
    # Open the file for reading. The 'with' statement ensures the file is closed automatically.
    with open(filename, 'r', newline='') as f:
        # Use csv.DictReader, specifying the delimiter as a tab ('\t') for TSV format.
        reader = csv.DictReader(f, delimiter='\t')
        
        # Iterate over each row (which is a dictionary) yielded by DictReader
        for row in reader:
            new_row = {}
            # Iterate over the key-value pairs of the current row (dictionary)
            for key, value in row.items():
                try:
                    # Attempt to convert the value to an integer
                    parsed_value = int(value)
                except ValueError:
                    # If conversion fails (e.g., for 'n/a' or non-numeric strings), set to None
                    parsed_value = None
                
                # Update the new dictionary with the key and the parsed value
                new_row[key] = parsed_value
            
            # Use 'yield' to return the new dictionary and make the function an iterator (generator)
            yield new_row

In [91]:
list(p6("data6_a.tsv"))

[{'a': 3, 'b': 4}]

In [92]:
list(p6("data6_b.tsv"))

[{'a': None, 'b': None, 'c': None}]

### Problem 7

Write a function `p7` that takes in a filename as an argument, reads it as a TSV with a header row, and returns an iterator of dictionaries like in the example code.
Each value should be parsed as an integer, and if the value does not parse successfully, set the value to 3.
In addition, add a new key “finagled” to each dictionary with value True if any value did not parse successfully and False otherwise.

Hints:
1. Use the function `p6` that you previously wrote for the shared work.
2. Use `is None` to check for parsing failures.
3. For the new finagled flag, set it to False initially, and change the value to True if you find a parsing failure.


In [93]:
# YOUR CHANGES HERE

def p7(filename):
    """
    Reads a TSV file with a header row, parses values as integers (setting failure to 3), 
    and adds a 'finagled' key to indicate if any parsing failure occurred in the row.
    Returns an iterator of dictionaries.
    """
    # Open the file for reading.
    with open(filename, 'r', newline='') as f:
        # Use csv.DictReader for TSV format.
        reader = csv.DictReader(f, delimiter='\t')
        
        # Iterate over each row (dictionary) from the file
        for row in reader:
            new_row = {}
            finagled_flag = False  # Hint 3: Set 'finagled' to False initially for this row
            
            # Iterate over the key-value pairs of the current row
            for key, value in row.items():
                parsed_value = None
                
                try:
                    # Attempt to convert the value to an integer
                    parsed_value = int(value)
                except ValueError:
                    # If conversion fails:
                    # 1. Set the parsed value to 3 (new requirement)
                    parsed_value = 3
                    # 2. Set the finagled flag to True (new requirement)
                    finagled_flag = True
                
                # Add the key and its parsed/defaulted value to the new dictionary
                new_row[key] = parsed_value
            
            # Add the required 'finagled' key to the dictionary
            new_row['finagled'] = finagled_flag
            
            # Use 'yield' to return the new dictionary and make the function an iterator (generator)
            yield new_row

In [94]:
list(p7("data7_a.tsv"))

[{'a': 3, 'b': 4, 'finagled': False}]

In [95]:
list(p7("data7_b.tsv"))

[{'a': 3, 'b': 3, 'c': 3, 'finagled': True}]

### Problem 8

Write a function `p8` that takes in three inputs - an input file name, an output filename, and a list of column names.
The function should read the input file using the TSV format and write the output file using the TSV format with just the specified input column names.
The output file should have the columns in the same order as the input column name list.


In [96]:
# YOUR CHANGES HERE

def p8(input_filename, output_filename, column_names):
    """
    Reads a TSV file, selects only the specified columns, and writes them 
    to a new TSV file in the order of the 'column_names' list.
    """
    # 1. Read the input TSV file
    with open(input_filename, 'r', newline='', encoding='utf-8') as infile:
        # Use DictReader for the input. It treats the first row as headers 
        # and reads each row as a dictionary. Delimiter is '\t' for TSV.
        reader = csv.DictReader(infile, delimiter='\t')
        
        # 2. Write to the output TSV file
        with open(output_filename, 'w', newline='', encoding='utf-8') as outfile:
            # Use DictWriter for the output. We must specify the fieldnames 
            # (which dictates the header row and column order) and the delimiter.
            writer = csv.DictWriter(
                outfile, 
                fieldnames=column_names, 
                delimiter='\t',
                extrasaction='ignore'  # Ignore keys in the input dicts that aren't in column_names
            )
            
            # Write the header row using the specified column names
            writer.writeheader()
            
            # 3. Iterate through input rows and write them to the output
            # DictWriter automatically handles selecting only the keys that 
            # match the specified 'fieldnames' and puts them in the right order.
            for row in reader:
                writer.writerow(row)

# The 'extrasaction='ignore'' parameter is crucial as it tells DictWriter to 
# discard columns from the input that are NOT in the 'column_names' list, 
# ensuring only the specified columns are written.

You can use the next two cells to test your function.

In [97]:
# test p8
p8("input-8.tsv", "output-8.tsv", ["height", "width", "color"])

In [98]:
try:
    with open("output-8.tsv") as check_fp:
        for line in check_fp:
            print(line.rstrip("\n"))
except FileNotFoundError:
    print("file not found")

height	width	color
45	23	red
62	15	blue
23	123	green


### Problem 9

Write a function `p9` that takes in a filename as an argument, reads it as a TSV with a header row, and returns the number of rows with data.

Hint:
*  This should be simple, but make sure not to count blank lines.

In [99]:
# YOUR CHANGES HERE

def p9(filename):
    """
    Reads a TSV file with a header row and returns the number of rows with data.
    """
    # Initialize a counter for the data rows
    row_count = 0
    
    # Open the file for reading. 'newline=' is crucial for csv module compatibility.
    with open(filename, 'r', newline='') as f:
        # Use DictReader for the TSV format (delimiter='\t').
        # DictReader automatically treats the first line as the header and 
        # only yields rows that can be successfully parsed as dictionaries 
        # (i.e., non-blank rows with the correct number of fields).
        reader = csv.DictReader(f, delimiter='\t')
        
        # Iterate over the reader. Each iteration represents one data row.
        for row in reader:
            row_count += 1
            
    return row_count

# The DictReader handles the complexity:
# 1. It skips the header row.
# 2. It reads only the data rows.
# 3. It automatically handles blank *lines* between data rows or at the end 
#    of the file, as they won't form valid dictionaries and won't be yielded.

In [100]:
p9("data9_a.tsv")

1

In [101]:
p9("data9_b.tsv")

0

### Problem 10

Write a function `p10` that takes in a filename as an argument, and returns `True` if the file is formatted as a TSV file and `False` otherwise.


Hint: You can do this just looking at the first line of the file.

In [102]:
# YOUR CHANGES HERE

def p10(filename):
    """
    Reads the first line of a file and returns True if it appears to be 
    TSV (contains a tab but no comma), and False otherwise.
    """
    try:
        with open(filename, 'r', encoding='utf-8') as f:
            # Read only the first line of the file
            first_line = f.readline().strip()

            # A file is considered a TSV if:
            # 1. It contains at least one tab character ('\t').
            # 2. It does NOT contain a comma (',').
            # This distinguishes it from a common CSV file.
            if '\t' in first_line and ',' not in first_line:
                return True
            else:
                return False
                
    except FileNotFoundError:
        # Handle the case where the file doesn't exist
        return False
    except Exception:
        # Handle other potential file reading errors
        return False

In [103]:
p10("data10_a.tsv")

True

In [104]:
p10("data10_b.tsv")

False

### Problem 11

The variable `p11` below is assigned using an f-string without formatting options.
Modify the f-string to display the number of visits with commas, and the average visit revenue with two digits after the decimal point.
You should only modify the f-string for this problem.

Feel free to search for the formatting options to guide you modifying the f-string.
You will learn the more commonly used options with practice.

In [105]:
# DO NOT CHANGE

q11a = 5125
q11b = 3.5123565123

In [106]:
# YOUR CHANGES HERE

p11 = f"Number of visits = {q11a:,}, average visit revenue = {q11b:.2f}"

In [107]:
p11

'Number of visits = 5,125, average visit revenue = 3.51'

### Problem 12

The variable `q12` below contains a line of text read from a CSV file.
Set `p12` to the floating point number in the first column of `q12`.

In [108]:
# DO NOT CHANGE

q12 = "6.4,dog,red,0.9\n"

In [109]:
# YOUR CHANGES HERE

# 1. Use .split(',') to break the CSV string into a list of columns.
# 2. Access the element at index 0 (the first column), which is the string "6.4".
# 3. Use the float() function to convert the string "6.4" into the floating-point number 6.4.
p12 = float(q12.split(',')[0])

In [110]:
p12

6.4

### Problem 13

Write a function `p13` that takes in a filename as an argument, reads it as a CSV with a header row, and returns a list of the column names.
The list of column names should be in the same order as in the file's header.

In [111]:
# YOUR CHANGES HERE

def p13(filename):
    """
    Reads a CSV file with a header row and returns a list of column names 
    in the order they appear in the file.
    """
    try:
        # Open the file for reading. 'newline=' is crucial for the csv module.
        with open(filename, 'r', newline='', encoding='utf-8') as f:
            # Use csv.DictReader. It automatically parses the first row as the header.
            reader = csv.DictReader(f)
            
            # The DictReader's 'fieldnames' attribute contains the column names 
            # as a list in the correct order.
            return reader.fieldnames
            
    except FileNotFoundError:
        # Handle the case where the file doesn't exist
        return []
    except Exception:
        # Handle cases where the file might be empty or improperly formatted
        return []

In [112]:
p13("data13_a.csv")

['foo', 'bar']

In [113]:
p13("data13_b.csv")

['foo', 'bar']

### Problem 14

Write a function `p14` that takes in an input filename and column name, parses the file as a CSV with a header row, and returns a list of the values in the given column.
If the column is missing, your function should return a KeyError.

In [114]:
# YOUR CHANGES HERE

def p14(filename, column_name):
    """
    Reads a CSV file, extracts and returns a list of all values 
    from the given column_name. Raises KeyError if the column is missing.
    """
    column_values = []
    
    # Open the file for reading.
    with open(filename, 'r', newline='', encoding='utf-8') as f:
        # Use DictReader for CSV. It reads the first row as headers 
        # and returns each data row as a dictionary.
        reader = csv.DictReader(f)
        
        # Iterate over each row dictionary
        for row in reader:
            # Access the value using the column_name as the dictionary key.
            # If 'column_name' is not a header, Python will raise a KeyError 
            # when accessing row[column_name], as required by the prompt.
            value = row[column_name]
            
            # Add the extracted value to the list
            column_values.append(value)
            
    return column_values

In [115]:
p14("data14_a.csv", "foo")

['1', '3']

In [116]:
p14("data14_a.csv", "bar")

['2', '4']

In [117]:
p14("data14_b.csv", "foo")

['1']

In [118]:
p14("data14_b.csv", "bar")

['2']

### Problem 15

Write a function `p15` that takes in a filename and string key, parses the file as JSON, and returns the value for that key. If the object in the JSON file is not a dictionary or the given key does not exist, then the function should return None.

In [119]:
# YOUR CHANGES HERE

def p15(filename, key):
    """
    Reads a file, parses it as JSON, and returns the value for the given key.
    Returns None if the JSON object isn't a dictionary or if the key is missing.
    """
    try:
        # 1. Open and read the JSON file
        with open(filename, 'r', encoding='utf-8') as f:
            data = json.load(f)
            
    except FileNotFoundError:
        # Handle non-existent files
        return None
    except json.JSONDecodeError:
        # Handle cases where the file content is not valid JSON
        return None
    except Exception:
        # Handle other potential file reading errors
        return None

    # 2. Check if the parsed object is a dictionary
    if not isinstance(data, dict):
        return None

    # 3. Safely retrieve the value for the key
    # Using .get() is the safest way to check for a key and return None if it's missing.
    # The default return value of .get() is None if the key is not found.
    return data.get(key)

In [120]:
p15("data15_a.json", "x")

3

In [121]:
p15("data15_a.json", "y")

### Problem 16

Write a function that takes in an input filename, parses the file as a TSV with a header row, and returns a dictionary with the average value of each column.

Hint:
* Write a helper function to compute the average of a list, and use list comprehensions to get all the values for each column.


In [122]:
# YOUR CHANGES HERE

def _calculate_average(data_list):
    """Helper function to compute the average of a list of numbers."""
    if not data_list:
        return 0.0  # Return 0.0 or handle as needed for an empty column
    return sum(data_list) / len(data_list)

def p16(filename):
    """
    Reads a TSV file with a header, computes the average of each column,
    and returns a dictionary of column averages. Non-numeric values are skipped.
    """
    column_data = {}  # Dictionary to store a list of all numeric values per column
    
    try:
        with open(filename, 'r', newline='', encoding='utf-8') as f:
            # Use DictReader for TSV (Tab-Separated Values)
            reader = csv.DictReader(f, delimiter='\t')
            
            # Initialize column_data with empty lists using fieldnames from the header
            if reader.fieldnames:
                for header in reader.fieldnames:
                    column_data[header] = []
            
            # 1. Iterate through rows and collect numeric data
            for row in reader:
                for column_name, value in row.items():
                    try:
                        # Attempt to convert the value to a float
                        numeric_value = float(value)
                        # Append only numeric values
                        column_data[column_name].append(numeric_value)
                    except ValueError:
                        # Skip values that cannot be parsed as a float
                        pass
            
    except FileNotFoundError:
        # Return an empty dictionary if the file is not found
        return {}
    
    # 2. Compute the average for each column using a dictionary comprehension
    # The helper function is used here to compute the final average
    average_by_column = {
        column_name: _calculate_average(values_list)
        for column_name, values_list in column_data.items()
    }
    
    return average_by_column

In [123]:
p16("data16_a.tsv")

{'foo': 1.0, 'bar': 2.0}

### Problem 17

Write a function `p17` that takes in a filename as an argument and a list of column names, parses the file as a CSV, and returns True if all the given columns are in the file and False otherwise.

In [124]:
# YOUR CHANGES HERE

def p17(filename, column_names):
    """
    Checks if all given column_names exist in the header of the CSV file.
    Returns True if all columns are present, False otherwise.
    """
    try:
        # 1. Read the file's header (fieldnames)
        with open(filename, 'r', newline='', encoding='utf-8') as f:
            reader = csv.DictReader(f)
            file_headers = reader.fieldnames
            
            # If the file is empty or has no header, return False immediately
            if not file_headers:
                return False

        # 2. Use set operations for efficient checking
        
        # Convert the file headers and the required column names to sets
        required_set = set(column_names)
        actual_set = set(file_headers)

        # Check if the set of required columns is a subset of the actual headers.
        # This is True only if ALL required columns are present in the file.
        return required_set.issubset(actual_set)

    except FileNotFoundError:
        # If the file doesn't exist, the columns cannot be present
        return False
    except Exception:
        # Handle other reading errors (e.g., file permissions, empty file contents)
        return False

In [125]:
p17("data17_a.csv", ["foo"])

True

In [126]:
p17("data17_a.csv", ["foo", "bar"])

True

In [127]:
p17("data17_a.csv", ["baz"])

False

### Problem 18

Write a function `p18` that takes in a filename, column name, and column value as arguments, parses the file as a TSV, and returns the first row where the given column has the given value. The row should be returned as a dictionary. If no such row exists, return None.


In [128]:
# YOUR CHANGES HERE

def p18(filename, column_name, column_value):
    """
    Reads a TSV file and returns the first row (as a dictionary) 
    where the specified column has the given value. Returns None if no row is found.
    """
    try:
        # Open the file for reading (TSV)
        with open(filename, 'r', newline='', encoding='utf-8') as f:
            # Use DictReader for TSV, which returns each row as a dictionary.
            reader = csv.DictReader(f, delimiter='\t')
            
            # Iterate through each row dictionary yielded by the reader
            for row in reader:
                # Check if the column exists in the row AND its value matches the target value
                # We use .get() to safely check for the column_name, although 
                # DictReader should ensure all rows have keys from the header.
                if row.get(column_name) == column_value:
                    # Found the first match, return the entire row dictionary immediately
                    return row
                    
        # If the loop finishes without finding a match, return None
        return None
        
    except FileNotFoundError:
        # Handle case where the input file doesn't exist
        return None
    except KeyError:
        # Handle case where the column_name doesn't exist in the header
        # In this scenario, DictReader rows won't have the key, leading to unexpected behavior 
        # or KeyError if not using .get(), but returning None is safe here.
        return None
    except Exception:
        # Handle other general reading/parsing errors
        return None

In [129]:
p18("data18_a.tsv", "foo", 3)

In [130]:
p18("data18_a.tsv", "bar", 4)

### Problem 19

The following function `p19` is supposed to check if its input list has at least 10 entries, and return the 10th entry if it exists, and None if it has fewer than 10 entries.

However, there is a bug in the code so it often returns wrong answers and sometimes crashes.
Fix the bug in `p19`.

In [131]:
# YOUR CHANGES HERE

def p19(input):
    # The list must have a length of at least 10 to contain the 10th entry (index 9).
    if len(input) >= 10:
        # The 10th entry is at index 9, not 10.
        return input[9]
    return None

In [132]:
# this should return "a"

p19("aaaaaaaaaaaaaa")

'a'

In [133]:
# this should return None
p19("bbbbbbb")

In [134]:
# this should return "j"

p19("abcdefghij")

'j'

### Problem 20

Set `p20` to be a list of filenames of the form "data20_X.tsv" that exist in the current directory.
X should be a one digit number from 0 to 9.
For example, the filename could be "data20_0.tsv", "data20_9.tsv", or any other filename using 0 to 9 to set X.



There are many ways to do this.
It can be done using just this week's lessons.
You may also find easier ways to check based on libraries.
You may use libraries to solve this problem, as long as they are installed by default in our Codespaces environment.
(If you try to install other libraries, your answer will likely be rejected by the auto-grader.)

In [135]:
# YOUR CHANGES HERE

import os
import re

# Define the regular expression pattern:
# ^data20_    -> Starts with "data20_"
# [0-9]       -> Followed by exactly one digit (0 to 9)
# \.tsv$      -> Ends with ".tsv" (the dot must be escaped with \)
pattern = re.compile(r"^data20_[0-9]\.tsv$")

# YOUR CHANGES HERE
# 1. List all files and directories in the current directory
all_files = os.listdir('.')

# 2. Use a list comprehension to filter the files that match the pattern
p20 = [
    filename 
    for filename in all_files 
    if pattern.match(filename)
]

In [136]:
p20

['data20_0.tsv', 'data20_8.tsv']