

# Module 1 Lab -  Exploring and Analyzing Missing Files in a Project
You are working on a new project, where you need to organize a number of report files in `.docx` format. Each `.docx` file should have a `.pdf` version (with the same name) as well as a supplementary `.csv` file but some `.csv` and `.pdf` files are missing. Your task is to identify the missing files and create a report of the name and count of missing `.pdf` and `.csv` files.

## Learning Objectives
Organize a large number of report files.
Identify missing files and create a report containing the name and count of missing pdf and csv files.
Process files using loops, conditionals, and string methods
Use functions to avoid repetition and make code easy to maintain

## Dataset Description
In order to complete this exercise you will need to access a directory named **`report_files/`** which contains:
- **a number of `.docx` files** (`report_<number>.docx`) → These are always present.
- **Some `.pdf` files are missing** (`report_<number>.pdf`).
- **Some `.csv` files are missing** (`Supplementary_data_R<number>.csv`).

<br>

> **NOTE:**
<br>To access the `report_file` folder, **you must** run the cell below, it will fetch the files from GitHub and store the directory in your (temp) files, which you can access by clicking on the Files icon on the left ribbon.


In [None]:
%%bash
set -euo pipefail

REPO_OWNER="UCB-urban-data101"
REPO_NAME="Module1_lab"

ZIP_URL_MAIN="https://github.com/${REPO_OWNER}/${REPO_NAME}/archive/refs/heads/main.zip"
ZIP_URL_MASTER="https://github.com/${REPO_OWNER}/${REPO_NAME}/archive/refs/heads/master.zip"

TMP_DIR="$(mktemp -d)"
ZIP_PATH="${TMP_DIR}/repo.zip"

# Fetch zip (wget; fallback to curl)
if command -v wget >/dev/null 2>&1; then
  wget -q -O "$ZIP_PATH" "$ZIP_URL_MAIN" || wget -q -O "$ZIP_PATH" "$ZIP_URL_MASTER"
else
  curl --fail -L -s -o "$ZIP_PATH" "$ZIP_URL_MAIN" || curl --fail -L -s -o "$ZIP_PATH" "$ZIP_URL_MASTER"
fi


unzip -oq "$ZIP_PATH" -d "$TMP_DIR"


SRC_DIR_MAIN="${TMP_DIR}/${REPO_NAME}-main"
SRC_DIR_MASTER="${TMP_DIR}/${REPO_NAME}-master"
if [ -d "${SRC_DIR_MAIN}/report_files" ]; then
  SRC_DIR="${SRC_DIR_MAIN}"
elif [ -d "${SRC_DIR_MASTER}/report_files" ]; then
  SRC_DIR="${SRC_DIR_MASTER}"
else
  echo "Could not locate report_files after unzip. Contents were:"
  find "$TMP_DIR" -maxdepth 2 -type d -print
  exit 1
fi

# Copy only the desired folder into CWD
rm -rf report_files
cp -R "${SRC_DIR}/report_files" .


chmod -R u+rwX,go+rX report_files
echo "Fetch complete. Top-level entries in ./report_files:"
ls -al "report_files" | head -n 20 || true


python - <<'PY'
import os, glob
root = "report_files"
assert os.path.isdir(root), "report_files directory not found"
csvs = sorted(glob.glob(os.path.join(root, "*.csv")))
if csvs:
    p = csvs[0]
    print(f"\n[Python] Reading first 5 lines of: {os.path.basename(p)}")
    with open(p, "r", encoding="utf-8", errors="replace") as f:
        for i, line in zip(range(5), f):
            print(line.rstrip("\n"))
else:
    print("\n[Python] No CSV detected to preview.")
PY


rm -rf "$TMP_DIR"



Fetch complete. Top-level entries in ./report_files:
total 5696
drwxr-xr-x 2 root root 20480 Sep 17 17:17 .
drwxr-xr-x 1 root root  4096 Sep 17 17:17 ..
-rw-r--r-- 1 root root 36595 Sep 17 17:17 report_0.docx
-rw-r--r-- 1 root root    24 Sep 17 17:17 report_0.pdf
-rw-r--r-- 1 root root 36596 Sep 17 17:17 report_100.docx
-rw-r--r-- 1 root root    36 Sep 17 17:17 report_100.pdf
-rw-r--r-- 1 root root 36596 Sep 17 17:17 report_101.docx
-rw-r--r-- 1 root root    26 Sep 17 17:17 report_101.pdf
-rw-r--r-- 1 root root 36596 Sep 17 17:17 report_102.docx
-rw-r--r-- 1 root root 36596 Sep 17 17:17 report_103.docx
-rw-r--r-- 1 root root    26 Sep 17 17:17 report_103.pdf
-rw-r--r-- 1 root root 36596 Sep 17 17:17 report_104.docx
-rw-r--r-- 1 root root 36597 Sep 17 17:17 report_105.docx
-rw-r--r-- 1 root root    26 Sep 17 17:17 report_105.pdf
-rw-r--r-- 1 root root 36596 Sep 17 17:17 report_106.docx
-rw-r--r-- 1 root root    36 Sep 17 17:17 report_106.pdf
-rw-r--r-- 1 root root 36597 Sep 17 17:17 rep

## Part I. Explore the Project Files

### List All Files in the `report_files/` Directory
- Import the `os` library
- Call `os.listdir()` and pass the directory path (`"report_files\") as an argument to get the list of files. (pass an argument means add it inside the parenthesis)

**Example:**
  ```
  import os

  # List all files in the directory
  myDocuments = os.listdir("my_documents")
  ```



### **❗️ TRY IT OUT ❗️**
Using the code cell below and the example code block above, load in the report_files directory AND print the first 10 files contained within the directory.  When you are done writing your code, be sure to run it.

> **Hint:** <br>
Use indexing to select the first 10 files to print.  To do this, use brackets after the variable name where you are storing your newly-created list.  
**Example:** `myDocuments[:10]`

In [None]:
# import the os library
import os

# YOUR CODE HERE:
#1. Create new variable called "files"
files = os.listdir("report_files")

#2. Identify the first 10 values in files using indexing and print the output.
print(files[:10])


['Supplementary_data_R24.csv', 'report_89.pdf', 'Supplementary_data_R31.csv', 'report_128.docx', 'report_116.docx', 'Supplementary_data_R122.csv', 'Supplementary_data_R37.csv', 'report_97.docx', 'report_7.docx', 'report_75.pdf']


---

## Part II: Process Files Using Loops, Conditionals, and String Methods
Now that we have progressed to Part II, we will have to complete the following tasks:
- Use loops to iterate over the list of files.
- Apply conditionals to filter files based on their extensions.
- Use string methods to find files with specific patterns or extract certain parts of filenames.

### Step 1: Categorize Files by Type *(Decomposition)*

In order to determine what files are present and what are missing within such a big directory, it might be helpful to summarize its contents. In this exercise you'll apply what you learned in Part I and learn how to use iterators to create separate lists for each file type.  You'll also generate a count of the number of `.pdf`, `.csv`, and `.docx` files contained within your file_resources directory.

### Step 2: Create an empty list that contains all the files with specific types
Creating empty lists allows you to collect and accumulate data as your program runs. Empty lists can be helpful when filtering results, storing user input, building datasets, or gathering items that meet certain conditions.

**Example:**
Let's say you're analyzing a directory containing photos and need to find all image files to create a backup or generate thumbnails. By creating an empty list first, you can:

- List all `.jpg`, `.png`, and `.gif` files in various folders
- Filter out corrupted or very small images
- Sort them by file size or creation date

### **❗️ TRY IT OUT ❗️**
Run the cell below to create an empty list for docx file types.

```
# This is formatted as code
```



In [None]:
docx_files = []

### Step 3: Set a condition and loop through the list of files
**Loops** allow you to repeat a block of code multiple times without having to write the same code over and over again.

> **Example:** <br>
  ```
  for fruit in ["apple", "banana", "orange"]:
      print(fruit)
  ```
**IF** statements allow your code to execute different actions based on whether certain conditions are true or false. In this step, you will create a for loop and apply an if statement to iterate through every file of a particular type in a folder to retrieve its name.

> **Example:** <br>
  ```
      if temperature > 80:
          print("It's hot outside!")
  ```


### **❗️ TRY IT OUT ❗️**
Edit the code block below to create a for loop that iterates through every file in your list of `files`.  Add a condition to search for a particular file extension, ".docx". Store the result in your new variable called "`docx_files`".

In [None]:
# YOUR CODE HERE:
# 1. Create your for loop
for file in files:

  # 2. Set your condition for the iterator.
  # Hint: use file.endswith(".docx") to set your condition.
  if file.endswith(".docx"):

      # 3. Add each docx file name to your newly-created docx_files list.
      # Hint: Use the append method to add each file name to docx files.
      # The syntax is as follows: docx_files.append()
      docx_files.append(file)

print(docx_files[:10])

['report_128.docx', 'report_116.docx', 'report_97.docx', 'report_7.docx', 'report_49.docx', 'report_93.docx', 'report_99.docx', 'report_65.docx', 'report_69.docx', 'report_94.docx']


### Step 4: Create separate lists for `.pdf` and `.csv` files

Now that we have created a list containing all of the names of the docx files in the directory, we will need to do the same for the other types of files in the directors, namely .pdf, and .csv files.

### **❗️ TRY IT OUT ❗️**
Follow the same steps we used previously and write code to create separate lists for `.pdf` and `.csv` files.

In [None]:
# Getting you started by creating an empty list to hold the names of the pdf files
pdf_files = []

# YOUR CODE HERE
for file in files:
  if file.endswith(".pdf"):
    pdf_files.append(file)

# Getting you started by creating an empty list to hold the names of the csv files
csv_files = []

# YOUR CODE HERE
for file in files:
  if file.endswith(".csv"):
    csv_files.append(file)



### Step 5: Run the code cell below to print the results.

In [None]:
print(docx_files, "\n", pdf_files, "\n", csv_files)


['report_128.docx', 'report_116.docx', 'report_97.docx', 'report_7.docx', 'report_49.docx', 'report_93.docx', 'report_99.docx', 'report_65.docx', 'report_69.docx', 'report_94.docx', 'report_66.docx', 'report_17.docx', 'report_107.docx', 'report_92.docx', 'report_46.docx', 'report_63.docx', 'report_9.docx', 'report_108.docx', 'report_18.docx', 'report_41.docx', 'report_119.docx', 'report_53.docx', 'report_6.docx', 'report_14.docx', 'report_109.docx', 'report_105.docx', 'report_47.docx', 'report_130.docx', 'report_122.docx', 'report_36.docx', 'report_91.docx', 'report_129.docx', 'report_12.docx', 'report_77.docx', 'report_72.docx', 'report_22.docx', 'report_75.docx', 'report_81.docx', 'report_133.docx', 'report_67.docx', 'report_90.docx', 'report_95.docx', 'report_96.docx', 'report_98.docx', 'report_20.docx', 'report_57.docx', 'report_2.docx', 'report_85.docx', 'report_58.docx', 'report_127.docx', 'report_37.docx', 'report_27.docx', 'report_13.docx', 'report_89.docx', 'report_78.docx', '

---
## Part III:  Patterns Recognition (Computational Thinking)
What was the block of code that you used more than once in the previous step?

Notice the recurring steps:
- Iterate over a list of files.
- Check if each file has a specific extension.
- Add the files with specific extension to a list.

If you have 10 different file extensions to count, how many times do you need to repeat that block of code?

Rather than writing separate code for each file type, define a function that takes the list of all files and
the file extension and returns the list of files with that extension. This approach makes the process reusable.

> **Example:**
  ```
  def say_hello():
      print("Hello, World!")
  ```

> **Note:**<br>
 We could also create these lists using an `if/elif` statement, but to practice writing functions and improve modularity, we will define a function instead.

### **❗️ TRY IT OUT ❗️**
Using functions helps avoid repetition and makes the code easier to modify if new file types need to be checked. Review the function below. Based on the function syntax, call the function and store the results for docx, pdf, and csv files in the provided variables.

In [None]:
def get_files_by_extension(files_lst: list[str], extension: str):
    """Returns a list of files that match the given extension."""
    f_ext_list = []
    for file in files_lst:
        if file.endswith(extension):
            f_ext_list.append(file)
    return f_ext_list

# List of all files in the directory
files = os.listdir("report_files/")

# Get the list of files by extension
docx_files_func =   get_files_by_extension(files, ".docx")
pdf_files_func  =   get_files_by_extension(files, ".pdf")
csv_files_func  =   get_files_by_extension(files, ".csv")

# Print the counts
print("Number of .docx files:", len(docx_files))
print("Number of .pdf files:", len(pdf_files_func))
print("Number of .csv files:", len(csv_files_func))

Number of .docx files: 135
Number of .pdf files: 100
Number of .csv files: 103


## Part IV: Identify Missing `.pdf` and `.csv` Files
Now that we have lists of `.docx`, `.pdf`, and `.csv` files, let's check which reports are missing their required counterparts.  In this part, we will extract report numbers from the `docx` files, check if corresponding `pdf` files exist, check if corresponding `csv` files exist, then count and report the missing files.



### Step 1: Breaking Down the Problem:
Before we can check for missing files, we need to extract the report numbers from our `.docx` filenames.  We do that using string splitting.

**How string splitting works:**
- `"report_5.docx".split("_")` returns `["report", "5.docx"]`
- We take the second part (`"5.docx"`) and remove `.docx` to get just the number

Let's extract all report numbers:
- Every `report_<number>.docx` should have:
  - A corresponding `report_<number>.pdf`.
  - A corresponding `Supplementary_data_R<number>.csv`.

### **❗️ TRY IT OUT ❗️**
Run the code below to view the output of the string splitting for the docx files.

In [None]:
# Create an empty list to store report numbers
docx_report_numbers = []

# Extract report numbers from the .docx filenames
for file in docx_files_func:
    # Split the filename at the underscore
    parts = file.split("_")

    # Check if the filename matches our expected pattern
    if len(parts) == 2 and parts[0] == "report":
        # Remove .docx extension to get just the number
        report_number = parts[1].replace(".docx", "")
        docx_report_numbers.append(report_number)
    else:
        print(f'{file} does not match the pattern!')

print(f"Found {len(docx_report_numbers)} report numbers:")
print(docx_report_numbers)

Found 135 report numbers:
['128', '116', '97', '7', '49', '93', '99', '65', '69', '94', '66', '17', '107', '92', '46', '63', '9', '108', '18', '41', '119', '53', '6', '14', '109', '105', '47', '130', '122', '36', '91', '129', '12', '77', '72', '22', '75', '81', '133', '67', '90', '95', '96', '98', '20', '57', '2', '85', '58', '127', '37', '27', '13', '89', '78', '84', '86', '59', '131', '35', '115', '83', '56', '8', '70', '16', '50', '28', '31', '0', '80', '44', '68', '120', '15', '25', '124', '40', '82', '54', '64', '42', '76', '1', '121', '24', '117', '104', '3', '62', '19', '29', '4', '26', '88', '61', '51', '71', '73', '5', '118', '55', '39', '101', '45', '30', '74', '111', '106', '125', '43', '33', '52', '132', '100', '11', '21', '113', '103', '10', '60', '112', '123', '87', '126', '23', '114', '110', '38', '79', '134', '34', '48', '102', '32']


### Step 2: Count and Report Missing Files

Now we'll check if each report has its corresponding PDF file. We'll:
1. Create the expected PDF filename using the report number
2. Check if that filename exists in our PDF files list
3. Keep track of any missing files

### **❗️ TRY IT OUT ❗️**
Run the code below to identify missing PDF files.

In [None]:
# Initialize counters and storage for missing files
num_missing_pdfs = 0
missing_pdfs = []

# Check each report number for corresponding PDF
for number in docx_report_numbers:
    # Create the expected PDF filename
    pdf_name = f"report_{number}.pdf"

    # Check if this PDF exists in our pdf_files list
    if pdf_name not in pdf_files:
        missing_pdfs.append(pdf_name)
        num_missing_pdfs += 1

print(f"Number of missing .pdf files: {num_missing_pdfs}")
if missing_pdfs:
    print("Missing PDF files:")
    for pdf in missing_pdfs:
        print(f"  - {pdf}")

Number of missing .pdf files: 35
Missing PDF files:
  - report_7.pdf
  - report_49.pdf
  - report_66.pdf
  - report_92.pdf
  - report_9.pdf
  - report_129.pdf
  - report_77.pdf
  - report_22.pdf
  - report_67.pdf
  - report_90.pdf
  - report_20.pdf
  - report_57.pdf
  - report_85.pdf
  - report_37.pdf
  - report_84.pdf
  - report_131.pdf
  - report_115.pdf
  - report_83.pdf
  - report_56.pdf
  - report_8.pdf
  - report_70.pdf
  - report_15.pdf
  - report_82.pdf
  - report_76.pdf
  - report_104.pdf
  - report_3.pdf
  - report_29.pdf
  - report_118.pdf
  - report_55.pdf
  - report_74.pdf
  - report_21.pdf
  - report_60.pdf
  - report_112.pdf
  - report_126.pdf
  - report_102.pdf


### Step 3: Check for Missing CSV Files

Now let's do the same check for CSV files. Remember that CSV files follow the pattern: `Supplementary_data_R<number>.csv`

### **❗️ TRY IT OUT ❗️**
Edit the code cell below to find missing CSV files.  To do this, refer to the comments for hints and follow the same logic we applied to the PDFs.

In [None]:
# Initialize counters and storage for missing CSV files
num_missing_csvs = 0
missing_csvs = []

# TODO: Complete this code following the PDF example above
for number in docx_report_numbers:
    # Create the expected CSV filename (hint: use the pattern Supplementary_data_R<number>.csv)
    csv_name = f"Supplementary_data_R{number}.csv"  # Fill this in

    # Check if this CSV exists in our csv_files list
    if csv_name not in csv_files:
        # Add to missing list and increment counter
        missing_csvs.append(csv_name)
        num_missing_csvs += 1
        pass

print(f"Number of missing .csv files: {num_missing_csvs}")
if missing_csvs:
    print("Missing CSV files:")
    for csv in missing_csvs:
        print(f"  - {csv}")

Number of missing .csv files: 32
Missing CSV files:
  - Supplementary_data_R116.csv
  - Supplementary_data_R97.csv
  - Supplementary_data_R7.csv
  - Supplementary_data_R9.csv
  - Supplementary_data_R6.csv
  - Supplementary_data_R36.csv
  - Supplementary_data_R12.csv
  - Supplementary_data_R72.csv
  - Supplementary_data_R67.csv
  - Supplementary_data_R90.csv
  - Supplementary_data_R57.csv
  - Supplementary_data_R85.csv
  - Supplementary_data_R13.csv
  - Supplementary_data_R89.csv
  - Supplementary_data_R70.csv
  - Supplementary_data_R50.csv
  - Supplementary_data_R28.csv
  - Supplementary_data_R15.csv
  - Supplementary_data_R64.csv
  - Supplementary_data_R76.csv
  - Supplementary_data_R117.csv
  - Supplementary_data_R19.csv
  - Supplementary_data_R29.csv
  - Supplementary_data_R118.csv
  - Supplementary_data_R55.csv
  - Supplementary_data_R33.csv
  - Supplementary_data_R52.csv
  - Supplementary_data_R60.csv
  - Supplementary_data_R23.csv
  - Supplementary_data_R114.csv
  - Supplementary

---

## Part V. Write Missing File Reports to CSVs
Now that we have identified the missing `.pdf` and `.csv` files, we need to save the results in a structured report.
Instead of saving the report directly in the main directory, we will **create a new folder** to store the report.

###Step 1: Create a Folder to Store Reports  
We will start this process by creating a folder to store the report file which will...
- Keep the reports **organized** and separate from the dataset files.
- Make it easier to **manage multiple reports** over time.
- Avoid clutter in the **main directory**.

>  **How to Create a Folder in Python** <br>
In Python, you can create a new folder using the `os.makedirs()` function.
This function ensures that the folder exists before attempting to write files into it.
If the folder already exists, it does **not** create a duplicate or raise an error.

> **How `os.makedirs()` Works:**
- If `missing_reports/` does **not** exist, it **creates** the folder.
- If `missing_reports/` folder **already exists**, the function does **nothing** because of the `exist_ok=True` argument.
- Unlike `os.mkdir()`, which only creates a single-level directory, `os.makedirs()` can also create **nested directories** if needed.

This ensures that when we later save the CSV report, the folder is available and ready.


### **❗️ TRY IT OUT ❗️**
Run the code cell below to create an new directory called "`missing_reports`"

In [None]:
# Define the folder name
report_folder = "missing_reports"

# Create the folder if it does not exist
os.makedirs(report_folder, exist_ok=True)

### Step 2: Write Missing File Reports to CSVs  
The missing file data will be **saved to two CSV files** for documentation and further analysis.

Before we create our reports, let's understand the key components of writing CSV files in Python:

**Key concepts:**
- `csv.writer()` - Creates an object that can write rows to a CSV file
- `writer.writerow([data])` - Writes a single row (list of values) to the file
- `mode="w"` - Opens file in write mode (overwrites existing file)
- `newline=""` - Prevents extra blank lines in the CSV file

**File structure we'll create:**
```
  missing_reports/
    missing_files_count.csv # Summary counts
    missing_files_list.csv # Detailed list

```

### Step 3: Create Summary Count
First, let's create a simple summary showing how many files are missing by type. This gives us a quick overview of the scope of missing files.


### **❗️ TRY IT OUT ❗️**
Run the code cell below to create a new csv containing the report

In [None]:
import csv

# Define the file path for the report
missing_count_csv_path = report_folder + "/" + "missing_files_count.csv"

# Write missing file counts
with open(missing_count_csv_path, mode="w", newline="") as csvfile:
    writer = csv.writer(csvfile)

    # Write header row
    writer.writerow(["File Type", "Count"])

    # Write total count rows
    writer.writerow(["PDF", num_missing_pdfs])
    writer.writerow(["CSV", num_missing_csvs])

print("Missing file count report saved to:", missing_count_csv_path)

Missing file count report saved to: missing_reports/missing_files_count.csv


### Step 4: Create Detailed List Report
Now let's create a detailed report that lists the actual filename of each missing file. This is more actionable - someone can use this list to create or locate the missing files.


### **❗️ TRY IT OUT ❗️**
Run the code cell below to create a new csv containing the report.

In [None]:
import csv # wasn't imported beforehand

# Define the file path
missing_names_csv_path = report_folder + "/" + "missing_files_list.csv"

# Write missing file names
with open(missing_names_csv_path, mode="w", newline="") as csvfile:
    #define our writer, this will be used to write rows into our csv file
    writer = csv.writer(csvfile)

    # Write header
    writer.writerow(["File Type", "Missing File Name"])

    # Write missing PDFs
    for pdf in missing_pdfs:
        writer.writerow(["PDF", pdf])

    # Write missing CSVs
    for csv in missing_csvs:
        writer.writerow(["CSV", csv])

print("Missing file list saved to:", missing_names_csv_path)

Missing file list saved to: missing_reports/missing_files_list.csv




---

## **Deliverables**
You should complete all "Try it Out" sections of this lab. When you are done with this notebook, simply save your work, return to the Google Assignments page via the link in Assignments in bCourses and hit "Submit".

### Your code should:
   - Print:
     - Total number of `.docx`, `.pdf`, and `.csv` files.
     - Number of `.docx` files missing `.pdf` and/or `.csv` files.
   - Save missing file reports in the `missing_reports/` folder:
     - **`missing_files_count.csv`** – Number of missing files per type.
     - **`missing_files_list.csv`** – Names of missing `.pdf` and `.csv` files.

---


## **Hints**
- Use **sets** for efficient file comparison.
- Write modular functions for reusability.

Good luck!
