<a href="https://colab.research.google.com/github/mgr5222/CMPSC472_Proj1/blob/main/CMPSC472_Proj1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1 - File processing system with multiprocessing and multithreading

The system should accept a directory containing multiple large text files. The goal is to count the frequency of a specific word (or set of words) in each file. Each file is processed in a separate process using fork(). Inside each process, multiple threads should be created to read and process portions of the file in parallel.

**If there is an error of not being able to open the file, be sure to include the files from the The Calgary Corpus by clicking the "Upload to session storage" and then selecting the appropriate files. Link to the location of the files has been included**


LINK: https://corpus.canterbury.ac.nz/descriptions/#calgary

Once the proper files have been included, be sure to double check the file paths and change them accordingly.

NOTE: there will be two programs, one that runs with the use of multiple threads in a single process, and one with each file to a single thread. The first program will run with a single thread attached to each child process, the second program will have 4 threads per child.



---

## Project Thoughts


NOTE NOTE: Below are steps were used to track my thought process while writing this code and is NOT an extensive discussion on how this code was written.

1.   **Create child processes**

Each child process will be allocated a file to read from the Canterbury Corpus. This will be done by creating an array of file names that will be pulled by each child process based on the process id


2.   **Create a thread with each child process**


Each Child will open a thread with will go to the "process" function (work in progress function name)

3.  **Have each thread read their files**

each thread will count the amount of words within their given file and output that amount aswell as send it back to the child process.

4. **Allocate memory to store the word count.**

The allocated memory will be used to communicate between the thread and child process. The word count is what will be allocated, and it will eventually be compared with the parent results.

5. **Change Child threads for multithreading processes**

The program will now be copied into a separate block to change the structer so that each child will break the source file into parts and then count the words in each part, then finally the sum of the parts will be returned.

6. **Send information to the Parent**

Now that the child processes completed their jobs, the information will be sent to the parent process for evaluation. For the context of this project. A histogram will be used by the parent process to display the results

7. **Adding functionality (not complete)**

Functionality needs to be added to track the frequency of words and displaying the top 50 for those words.

Time tracking has been added to track how long each child process took. This has been added to both files


In [209]:
%%writefile proj1.c
#include <pthread.h>
#include <stdio.h>
#include <ctype.h>  // For isspace() and other character checking functions
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <semaphore.h>
#include <sys/wait.h>
#include <time.h>  // Include time.h for timing functionality


#define NUM_FILES 7

typedef struct{
  char *file;
  int *wordCount;
} ThreadData;


void *childProcess(void *arg){
  ThreadData *data = (ThreadData *)arg;
  char *file = data->file;
  int NUM_WORDS = 0;
  int inWord = 0; // Flag to track if we are in a word or not

  // Test to make sure file is recieved by thread process
  // printf("recieved %s\n", file);

  // Open file
  FILE* fileprocess;
  fileprocess = fopen(file, "r");
  if (fileprocess == NULL){
    printf("FILE DID NOT OPEN");
  }
  else{

    // read file untill end of file (EOF)
    char reading;
    while ((reading = fgetc(fileprocess)) != EOF){
      if (isspace(reading) || reading == '-') {
          // If we encounter a space or newline, we are no longer in a word
          inWord = 0;
      }else if (!inWord) {
          // We have entered a new word
          inWord = 1;
          NUM_WORDS++;
      }
        }
        fclose(fileprocess);
        //printf("Child found %d words in %s\n", NUM_WORDS, file);
    }
  *(data->wordCount) = NUM_WORDS;

  pthread_exit(NULL);
}


int main(){
  pid_t pid;
  pthread_t read;



  // File names that will be used in threads (CHANGE DIRECTORY LATER)
  char *filepaths[] = {
    "/content/bib",
    "/content/paper1",
    "/content/paper2",
    "/content/progc",
    "/content/progl",
    "/content/progp",
    "/content/trans"
  };


  // create child processes
  for(int i = 0; i < NUM_FILES; i++){

    pid = fork();

    if(pid == 0){
      // child process
      printf("Child process %d created: Process %d will open %s\n", i, i, filepaths[i]);
      // START CLOCK
      clock_t startTime = clock();

      // Allocate memory
      int *wordCount = malloc(sizeof(int));

      ThreadData data;
      data.file = filepaths[i];
      data.wordCount = wordCount;

      // create threads within child processes
      pthread_create(&read, NULL, childProcess, (void *)&data);


      // wait for thread process to finish
      pthread_join(read, NULL);

      printf("Child process %d words counted: %d \n", i, *wordCount);
      // STOP CLOCK
      clock_t endTime = clock();
      double timeTaken = (double)(endTime - startTime) / CLOCKS_PER_SEC; // Calculate elapsed time
      printf("Child process %d completed in %.4f seconds.\n", i, timeTaken);
      free(wordCount);
      exit(0);
    }
    else if(pid > 0){
      // parent process
      printf("Parent process will wait for Child %d\n", i);
      // parent process will wait for child process

      wait(NULL);
    }else{
      perror("fork failed");
    }
  }

  return 0;

}

Overwriting proj1.c


In [210]:
%%shell
gcc proj1.c -o proj1
./proj1

Parent process will wait for Child 0
Child process 0 created: Process 0 will open /content/bib
Child process 0 words counted: 20033 
Child process 0 completed in 0.0031 seconds.
Parent process will wait for Child 1
Child process 1 created: Process 1 will open /content/paper1
Child process 1 words counted: 8670 
Child process 1 completed in 0.0011 seconds.
Parent process will wait for Child 2
Child process 2 created: Process 2 will open /content/paper2
Child process 2 words counted: 13943 
Child process 2 completed in 0.0018 seconds.
Parent process will wait for Child 3
Child process 3 created: Process 3 will open /content/progc
Child process 3 words counted: 6362 
Child process 3 completed in 0.0015 seconds.
Parent process will wait for Child 4
Child process 4 created: Process 4 will open /content/progl
Child process 4 words counted: 11424 
Child process 4 completed in 0.0026 seconds.
Parent process will wait for Child 5
Child process 5 created: Process 5 will open /content/progp
Child



In [211]:
%%writefile proj1_multithreading.c
#include <pthread.h>
#include <stdio.h>
#include <ctype.h>  // For isspace() and other character checking functions
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <semaphore.h>
#include <sys/wait.h>
#include <time.h>  // Include time.h for timing functionality


#define NUM_FILES 7
#define NUM_THREADS 4 // 4 threads to equally divide the work

typedef struct {
  char *file;
  int wordCount;
  long start; // Start position of the file for each thread
  long end;   // End position of the file for each thread
} ThreadData;

void *process(void *arg) {
  ThreadData *data = (ThreadData *)arg;
  int NUM_WORDS = 0;
  int inWord = 0; // Flag to track if we are in a word or not

  // Open file
  FILE *fileprocess;
  fileprocess = fopen(data->file, "r");
  if (fileprocess == NULL) {
    printf("FILE DID NOT OPEN\n");
  } else {
    // Move to starting point in the file
    fseek(fileprocess, data->start, SEEK_SET);

    char reading;
    // Read until the end of the assigned snippet or until EOF
    while (ftell(fileprocess) < data->end && (reading = fgetc(fileprocess)) != EOF) {
      if (isspace(reading) || reading == '-') {
        inWord = 0;  // No longer in a word
      } else if (!inWord) {
        inWord = 1;  // New word found
        NUM_WORDS++;
      }
    }

    fclose(fileprocess);
    data->wordCount = NUM_WORDS;  // Store the result in the thread data
    pthread_exit(NULL);
  }
}

int main() {
  pid_t pid;
  pthread_t readf[NUM_THREADS];

  // Create pipes
  int pipefd[2];
  pipe(pipefd);

  // File names that will be used in threads (CHANGE DIRECTORIES IF NEEDED)
  char *filepaths[] = {
    "/content/bib",             // CHANGE DIRECTORIES IF NEEDED
    "/content/paper1",          // CHANGE DIRECTORIES IF NEEDED
    "/content/paper2",          // CHANGE DIRECTORIES IF NEEDED
    "/content/progc",           // CHANGE DIRECTORIES IF NEEDED
    "/content/progl",           // CHANGE DIRECTORIES IF NEEDED
    "/content/progp",           // CHANGE DIRECTORIES IF NEEDED
    "/content/trans"            // CHANGE DIRECTORIES IF NEEDED
  };

  // Parent will store all word counts for a histogram
  int wordCountResults[NUM_FILES] = {0};

  // Create child processes
  for (int i = 0; i < NUM_FILES; i++) {

    pid = fork();

    if (pid == 0) {
      // Child process
      printf("Child process %d created: Process %d will open %s\n", i, i, filepaths[i]);

      // START CLOCK
      clock_t startTime = clock();

      // Open the file to get its size
      FILE *file = fopen(filepaths[i], "r");
      fseek(file, 0, SEEK_END);
      long fileSize = ftell(file);  // Get the file size
      fclose(file);

      // Divide the file into equal parts
      long fileSnippet = fileSize / NUM_THREADS;

      ThreadData data[NUM_THREADS];

      // Create threads within child processes
      for (int j = 0; j < NUM_THREADS; j++) {
        data[j].file = filepaths[i];
        data[j].start = j * fileSnippet; // Start position for the reading
        // Conditional to decide the endpoint of the file snippet
        if (j == NUM_THREADS - 1) {
          data[j].end = fileSize; // For the last thread
        } else {
          data[j].end = (j + 1) * fileSnippet; // For other threads
        }
        data[j].wordCount = 0;

        pthread_create(&readf[j], NULL, process, (void *)&data[j]);

      }

      // Wait for all thread processes to finish and add results
      int totalWords = 0;
      for (int j = 0; j < NUM_THREADS; j++) {
        pthread_join(readf[j], NULL);
        totalWords += data[j].wordCount;
      }

      printf("Child process %d words counted: %d\n", i, totalWords);

      // STOP CLOCK
      clock_t endTime = clock();
      double timeTaken = (double)(endTime - startTime) / CLOCKS_PER_SEC; // Calculate elapsed time
      printf("Child process %d completed in %.4f seconds.\n", i, timeTaken);

      // Send word count to the parent via the pipe
      close(pipefd[0]);  // Close reading end in the child
      write(pipefd[1], &totalWords, sizeof(totalWords));
      close(pipefd[1]);  // Close writing end after sending the data

      exit(0);
    } else if (pid > 0) {
      // Parent process
      printf("Parent process will wait for Child %d\n", i);
      // Parent process waits for child process
      wait(NULL);  // Wait for child processes to finish
      read(pipefd[0], &wordCountResults[i], sizeof(int));  // Read word count from the pipe
      printf("Parent has read from Child %d\n", i);
    } else {
      perror("fork failed");
    }
  }

  close(pipefd[0]);  // Close reading end of the pipe

  // Display the histogram of word counts with asterisks
  printf("\nWord Count Histogram:\n");
  for (int i = 0; i < NUM_FILES; i++) {
    printf("File %d (%s): ", i + 1, filepaths[i]);
    // Print an asterisk for each 1000 words
    for (int j = 0; j < wordCountResults[i] / 1000; j++) {
      printf("*");
    }
    printf(" (%d words)\n", wordCountResults[i]);
  }

  return 0;
}


Overwriting proj1_multithreading.c


In [212]:
%%shell
gcc proj1_multithreading.c -o proj1_multithreading
./proj1_multithreading

Parent process will wait for Child 0
Child process 0 created: Process 0 will open /content/bib
Child process 0 words counted: 20034
Child process 0 completed in 0.0083 seconds.
Parent has read from Child 0
Parent process will wait for Child 1
Child process 1 created: Process 1 will open /content/paper1
Child process 1 words counted: 8672
Child process 1 completed in 0.0048 seconds.
Parent has read from Child 1
Parent process will wait for Child 2
Child process 2 created: Process 2 will open /content/paper2
Child process 2 words counted: 13945
Child process 2 completed in 0.0065 seconds.
Parent has read from Child 2
Parent process will wait for Child 3
Child process 3 created: Process 3 will open /content/progc
Child process 3 words counted: 6364
Child process 3 completed in 0.0033 seconds.
Parent has read from Child 3
Parent process will wait for Child 4
Child process 4 created: Process 4 will open /content/progl
Child process 4 words counted: 11425
Child process 4 completed in 0.0058 

