# Lecture 6 : Characters, Strings, and Dynamic Memory Allocation

# Part 1 : Characters

## The C *char* type is one byte that is used to store characters such as the letters *a* and *b* and the punctuation symbol *!*.

## Working with the C char type and arrays of the C char called *strings* is how we process text in C.

## To see the characters and letters that certain values correspond to we use an [ASCII-TABLE](https://www.ascii-code.com/).

In [1]:
%%writefile char.c
#include <stdio.h>

int main () {
    char c = 'A';
    printf ("c as a number is %d\n",c);
    printf ("c as a character is %c\n",c);
}

Writing char.c


In [2]:
!gcc -o char char.c

In [3]:
!./char

c as a number is 65
c as a character is A


## Exercise : What is the range of ASCII values for the upper case letters?  Lower case letters?

# Part 2 : Strings

## A C string is a **null-terminated** array of characters.

## Strings can be initialized using the syntax:

    char str[] = "Hokies";

## The array *str* is actually of length 7 since it is null-terminated.

## Here is a short C program that illustrates the array *str*.

In [4]:
%%writefile hokies.c
#include <stdio.h>

int main () {
    char str[] = "Hokies";
    for (int i=0;i<7;i++) {
        printf ("%d\n",str[i]); // print the ASCII code of each character
    }
}

Writing hokies.c


In [5]:
!gcc -o hokies hokies.c

In [6]:
!./hokies

72
111
107
105
101
115
0


## Exercise : Lookup each of the above ASCII codes to verify the characters in *Hokies*.

## Note that the ASCII value immediately following the six characters in *Hokies* is 0.  

## This 0 is the ASCII value for the *null character*.  

## The inclusion of this *null character* in C strings is why we say that C strings are *null-terminated*.

## The null character is critical for C strings because we use it to determine how long strings are.  

### Remember that in C arrays are not objects so we need a separate mechanism to keep track of the length.

## Note that the line of C code
    char str[] = "Hokies";
## is equivalent to

    char str[7] = { 'H', 'o', 'k', 'i', 'e', 's', '\0' };

## We prefer the first version because it is much easier to read!

## Also note that the null character can be specified using *'\0'*.  

## This is similar to how we specify the new line character using *'\n'*.

## There are various C functions for processing strings.  

## You can include the interfaces to these functions using:
    #include <string.h>

## One very useful string function is *strlen* which returns the number of characters in a string

## Note that the count return by *strlen* does not include the null-terminator.

## Note that the function *strlen* returns a long unsigned int.  

## We use the format specifier *%lu* to print a long unsigned int.

## Also note that we print a C string using the format specifier *%s*.  

## The function *printf* prints the characters of a given string until it encounters the null-terminator.

In [7]:
%%writefile strlen.c
#include <stdio.h>
#include <string.h>

int main () {
    char str[] = "Hokies";
    printf ("The length of the string %s is %lu.\n",str,strlen(str));
}

Writing strlen.c


In [8]:
!gcc -o strlen strlen.c

In [9]:
!./strlen

The length of the string Hokies is 6.


# Part 3 : String and Pointers

## A pointer to a C string is a character pointer.
    char* str = "Hokies"

## Note that the above line of C code is very different than:
    char str[] = "Hokies"

## In the first line of C code, the pointer *str* points to a string that is stored in **constant memory**.  

## In the second line of C code, *str* is an array of characters with size 7 (6 for the letters in Hokies and 1 for the null-teriminator) which will be **initialized** to the contain the characters in the given string.

## Be careful ... constant memory is **read only**!

## Writing to constant memory will trigger a segmentation fault!

In [10]:
%%writefile danger.c
#include <stdio.h>

int main () {
    char* str = "Hokies";
    str[0] = 'h';
    printf ("The string in lower case is %s.\n",str);
}


Writing danger.c


In [11]:
!gcc -o danger danger.c

In [12]:
!./danger > out.txt

/bin/bash: line 1:   519 Segmentation fault      (core dumped) ./danger > out.txt


In [13]:
!cat out.txt

## Exercise : Fix the above code so that it works as expected.

## Here is an interesting example that uses an array of string pointers.  

In [14]:
%%writefile "mystery.c"
#include <stdio.h>

int main () {
    char* a[6] = { "Planet", "Hello", "Earth", "Go", "There", "Let's" };
    char* b[6] = { "Red", "Blue", "Hokies", "Green", "World", "Orange" };
    char* c[3] = { a[5], a[3], b[2] };
    char** d = c;
    printf ("%s %s %s!\n",d[0],d[1],d[2]);
}

Writing mystery.c


In [15]:
!gcc -o mystery mystery.c

In [16]:
!./mystery

Let's Go Hokies!


## Note that in line 7 we declare *d* to have type char**.  

## This literally means that *d* is a pointer to a pointer to a character.  

## Where have we seen a variable of type *char*** before?

# Part 4 : Command Line Arguments Revisited

## C command line arguments are strings!  

## The line of code
    int main (int argc, char** argv) {

## specifies the argv argument to have type pointer to a pointer to a character.

## More simply, *argv* is a pointer that points to an array of string pointers.  
  
## When we dereference *argv* using *argv[0]* we get a pointer to the first command line argument which is a string.

## Exercise: what does *argv[1]* point to?

## Here is an example C code that prints a command line argument in lower case.

## This example illustrates that we can overwrite C command line arguments.

## Exercise : What properties of the ASCII-TABLE are we taking advantage of in the code below?

In [17]:
%%writefile lower.c
#include <stdio.h>
#include <string.h>

int main (int argc, char** argv) {
    if (argc < 2) {
        printf ("command usage: %s str\n",argv[0]);
        return 1;
    }
    char* str = argv[1];
    for (int i=0;i<strlen(str);i++) {
        if ((str[i] >= 'A') && (str[i] <= 'Z')) {
            str[i] += 'a'-'A';
        }
    }
    printf ("The command line argument in lower case is %s.\n",str);
}

Writing lower.c


In [18]:
!gcc -o lower lower.c

In [19]:
!./lower HOKIES

The command line argument in lower case is hokies.


## Here is a code that ensures that all command line arguments contain only lower case letters.  

## This example illustrates that C strings are passed to functions by pointer (just like other arrays in C).

In [20]:
%%writefile check.c
#include <stdio.h>
#include <string.h>
#include <stdbool.h>

bool is_lower(char* str) {
    for (int i=0;i<strlen(str);i++) {
        if ((str[i] < 'a') || (str[i] > 'z')) {
            return false;
        }
    }
    return true;
}

int main (int argc, char** argv) {
    for (int i=1;i<argc;i++) {
        if (!is_lower(argv[i])) {
            printf ("The command line argument %s is not all lower case letters.\n",argv[i]);
            return 0;
        }
    }
    printf ("All command line arguments have only lower case letters.\n");
}

Writing check.c


In [21]:
!gcc -o check check.c

In [22]:
!./check Hello world

The command line argument Hello is not all lower case letters.


In [23]:
!./check this is great!

The command line argument great! is not all lower case letters.


In [24]:
!./check 2 + 2 = 4

The command line argument 2 is not all lower case letters.


In [25]:
!./check lets go

All command line arguments have only lower case letters.


# Part 5 : Working with a list of possible Wordle answers.

## Let's use Git to grab a file containing possible Wordle answers.

In [26]:
!git clone https://code.vt.edu/jasonwil/cmda3634_materials.git

Cloning into 'cmda3634_materials'...
remote: Enumerating objects: 163, done.[K
remote: Counting objects: 100% (126/126), done.[K
remote: Compressing objects: 100% (121/121), done.[K
remote: Total 163 (delta 43), reused 0 (delta 0), pack-reused 37 (from 1)[K
Receiving objects: 100% (163/163), 25.22 MiB | 38.04 MiB/s, done.
Resolving deltas: 100% (48/48), done.


In [27]:
!cp cmda3634_materials/L06/* .

## The number of words in the file:

In [28]:
!wc -l answers.txt

2309 answers.txt


## The first 10 answers:

In [29]:
!head -10 answers.txt

aback
abase
abate
abbey
abbot
abhor
abide
abled
abode
abort


## Here is a C program that searches the Wordle answer list for a given word.  

## Note that we have to very careful when using *scanf* to read a string from a file.  

## Consider the following code snippet:

    char next[6];
    while (scanf("%5s",next) == 1)

## A wordle word has 5 characters.

## We do set next to have size 6 instead of 5?

## By using the format specifier *%5s* we instruct *scanf* to not read strings that are longer than 5 characters to avoid going off the end of the *next* array.

## Note that *scanf* null-terminates the string it reads in.

## The string function *strcmp* returns 0 if the two string arguments are identical (i.e. they have the same length and the same characters).

In [30]:
%%writefile search.c
#include <stdio.h>
#include <string.h>

int main (int argc, char** argv) {
    if (argc < 2) {
        printf ("command usage: %s %s\n",argv[0],"word");
        return 1;
    }
    char* word = argv[1];
    char next[6]; // Need 5 chars for Wordle word and 1 for null terminator.
    while (scanf("%5s",next) == 1) { // %5s tells scanf to read at most 5 characters
        if (strcmp(word,next) == 0) { // strcmp returns 0 if the strings are equal
            printf ("%s is a possible Wordle answer.\n",word);
            return 0;
        }
    }
    printf ("%s is not a possible Wordle answer.\n",word);
}

Writing search.c


In [31]:
!gcc -o search search.c

In [32]:
!cat answers.txt | ./search hello

hello is a possible Wordle answer.


In [33]:
!cat answers.txt | ./search aargh

aargh is not a possible Wordle answer.


## Here is a C program that determines the most frequent letter in a given blank.  

## The command line argument blank is a number from 0 to 4 where 0 is the first blank, 1 is the second blank, etc.

## Exercise : Carefully explain what the following line of code is doing.

    count[next[blank]-'a'] += 1;

In [34]:
%%writefile frequent.c
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char** argv) {
    if (argc < 2) {
        printf ("command usage: %s %s\n",argv[0],"blank");
    }
    int blank = atoi(argv[1]); // blank is a number from 0 to 4
    int count[26] = { 0 };
    char next[6];
    int total_words = 0;
    while (scanf("%5s",next) == 1) {
        count[next[blank]-'a'] += 1;
        total_words += 1;
    }
    int max_count = 0;
    char most_common;
    for (int i=0;i<26;i++) {
        if (count[i] > max_count) {
            max_count = count[i];
            most_common = 'a'+i;
        }
    }
    printf ("The most frequently occuring letter in blank %d is %c.\n",
            blank,most_common);
    printf ("The letter %c occurs %d times in blank %d out of %d total words.\n",
            most_common,max_count,blank,total_words);
}

Writing frequent.c


In [35]:
!gcc -o freqeunt frequent.c

In [36]:
!cat answers.txt | ./freqeunt 0

The most frequently occuring letter in blank 0 is s.
The letter s occurs 365 times in blank 0 out of 2309 total words.


# Part 6 : Dynamic Memory Allocation

# Example : Sieve of Eratosthenes

## The Sieve of Erathostenes is a method for finding all prime numbers below a certain number.  

## The wiki page https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes has a graphical demo showing how the algorithm works.  

## We use an integer array of size n+1 to keep track of which numbers are prime.  

## Here is our initial attempt at a C implementation.  

In [37]:
%%writefile count_primes_v1.c
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char** argv) {

    // count number of primes <= n
    // where n is read from the command line
    if (argc < 2) {
        printf ("command usage: %s %s\n",argv[0],"n");
        return 1;
    }
    int n = atoi(argv[1]);

    // initially assume all integers >= 2 are prime
    // note that we ignore is_prime[0] and is_prime[1]
    int is_prime[n+1];
    for (int i = 2; i <= n; i++) {
        is_prime[i] = 1;
    }

    // mark non-primes <= n using Sieve of Eratosthenes
    for (int d=2;d*d<=n;d++) {
        // if d is prime, then mark multiples of d as non-prime
        // suffices to consider multiples d*d, d*d+d, d*d+2d, ...
        if (is_prime[d]) {
            for (int c=d*d;c<=n;c+=d) {
                is_prime[c] = 0;
            }
        }
    }

    // count primes
    int primes = 0;
    for (int i = 2; i <= n; i++) {
        if (is_prime[i]) {
            primes++;
        }
    }
    printf ("The number of primes <= %d is %d",n,primes);
}

Writing count_primes_v1.c


In [38]:
!gcc -o count_primes_v1 count_primes_v1.c

In [39]:
!./count_primes_v1 10

The number of primes <= 10 is 4

## Let's try n equal to one million.

In [40]:
!time ./count_primes_v1 1000000

The number of primes <= 1000000 is 78498
real	0m0.020s
user	0m0.015s
sys	0m0.004s


## Let's try n equal to ten million.

In [41]:
!time ./count_primes_v1 10000000

/bin/bash: line 1:   598 Segmentation fault      (core dumped) ./count_primes_v1 10000000

real	0m0.046s
user	0m0.004s
sys	0m0.040s


## In version 1 we put the array on the stack which severely limits how big of an n we can handle.  

## In order to handle larger n, we will need to use **dynamic memory allocation**.  

## There are **two pools of memory** that our C programs can use.  

## The first pool of memory is called the **stack** which has a very limited size.  

## If you declare an array variable using the notation *int a[100]* then the 400 bytes of memory (an int is 4 bytes) for that array is allocated on the **stack**.  

## The second pool of memory is called the **heap** which has a much larger size than the **stack**.

## Memory on the heap is accessed very differently than memory on the stack.  

## To use heap memory, we need to ask the **memory manager** for a certain number of bytes.  If those bytes are available, then the **memory manager** will return to us a pointer to the beginning of our requested memory.

## We need to **free** up our requested memory when we no longer require it.

## We interface with the **memory manager** using C functions such as **malloc**, **calloc**, and **free**.

## Here is an example illustrating the basics of dynamic memory allocation on the heap.


In [42]:
%%writefile heap.c
#include <stdio.h>
#include <stdlib.h>

int main () {

    // allocate an array of 5 integers on the stack
    int a[5];

    // allocate an array of 5 integers on the heap
    // malloc stands for "memory allocation"
    // memory allocated using malloc is not initialized.
    int* b = (int*)malloc(5*sizeof(int));

    // allocate an array of 5 integers on the stack
    // and initialize the array to contain all 0s.
    int c[5] = { 0 };

    // allocate an array of 5 integers on the heap
    // and initialize the array to contain all 0s.
    // calloc stands for "clear allocate"
    // note that the interface to calloc differs slightly
    int* d = (int*)calloc(5,sizeof(int));

    // all of these arrays can be used in the same way
    a[2] = 10;
    b[2] = 2*a[2];
    c[2] += a[2]+b[2];
    d[2] += c[2]+a[2];
    printf ("%d %d %d %d\n",a[2],b[2],c[2],d[2]);

    // Memory for stack variables is automatically
    // freed when it is no longer needed.
    // Memory for heap variables is freed manually.
    // Do not continue to use heap memory after it has been freed!
    free(b);
    free(d);
}

Writing heap.c


In [43]:
!gcc -o heap heap.c

In [44]:
!./heap

10 20 30 40


## Let revisit our program *bigarray.c* from lecture 5 but this time we use *malloc* to allocate the array.

In [45]:
%%writefile bigarray_v1.c
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char** argv) {
    if (argc < 2) {
        printf ("command usage: %s %s\n",argv[0],"size");
        return 1;
    }
    long len = atol(argv[1]);
    printf ("A uses %ld bytes of storage\n",len*sizeof(int));
    int* A = (int*)malloc(len*sizeof(int));
    A[0] = 12345;
    A[len-1] = 54321;
    printf ("last element of A is %d\n",A[len-1]);
}

Writing bigarray_v1.c


In [46]:
!gcc -o bigarray_v1 bigarray_v1.c

## Let's try to create an array of one million integers.

In [47]:
!./bigarray_v1 1000000

A uses 4000000 bytes of storage
last element of A is 54321


## Let's try to create an array of one billion integers.

## This array has size 4 billion bytes which is 4 gigabytes.

In [48]:
!./bigarray_v1 1000000000

A uses 4000000000 bytes of storage
last element of A is 54321


## Let's try to create an array of 100 trillion integers.

## This array has size 400 trillion bytes or 400 terabytes.

In [49]:
!./bigarray_v1 100000000000000 > output

/bin/bash: line 1:   617 Segmentation fault      (core dumped) ./bigarray_v1 100000000000000 > output


## Unfortunately, we got a segmentation fault.  

## This example shows that it is **critical to check the return value of malloc** (and calloc) to ensure the allocation was successful.

## If malloc is unsuccessful it will return a NULL pointer.

## Dereferencing a NULL pointer always results in a segmentation fault!

## Let's correct our program to check the return value for NULL.

In [50]:
%%writefile bigarray_v2.c
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char** argv) {
    if (argc < 2) {
        printf ("command usage: %s %s\n",argv[0],"size");
        return 1;
    }
    long len = atol(argv[1]);
    printf ("A uses %ld bytes of storage\n",len*sizeof(int));
    int* A = (int*)malloc(len*sizeof(int));
    if (A == NULL) {
        printf ("malloc failed to allocate memory for A.\n");
        return 1;
    }
    A[0] = 12345;
    A[len-1] = 54321;
    printf ("last element of A is %d\n",A[len-1]);
}

Writing bigarray_v2.c


In [51]:
!gcc -o bigarray_v2 bigarray_v2.c

## Let's try to create an array of one trillion integers.

## This array has size 4 trillion bytes which is 4 terabytes.

In [52]:
!./bigarray_v2 1000000000000

A uses 4000000000000 bytes of storage
last element of A is 54321


## Let's try to create an array of 100 trillion integers.

## This array has size 400 trillion bytes or 400 terabytes.

In [53]:
!./bigarray_v2 100000000000000

A uses 400000000000000 bytes of storage
malloc failed to allocate memory for A.


## In this version we have a graceful exit when malloc fails.

## Here is a version of our sieve code that counts primes using an array on the heap.

## Note that we check the return value of malloc!

In [54]:
%%writefile count_primes_v2.c
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char** argv) {

    // count number of primes <= n
    // where n is read from the command line
    if (argc < 2) {
        printf ("command usage: %s %s\n",argv[0],"n");
        return 1;
    }
    int n = atoi(argv[1]);

    // initially assume all integers >= 2 are prime
    // note that we ignore is_prime[0] and is_prime[1]
    // We dynamically allocate the array using malloc.
    // It is critical to check the return value of
    // malloc to ensure the allocation was successful.
    // If malloc is unsuccessful it will return a NULL pointer.
    // Dereferencing a NULL pointer always results in a segmentation fault!
    int* is_prime = (int*)malloc((n+1)*sizeof(int));
    if (is_prime == NULL) {
        printf ("malloc failed to allocate is_prime array!\n");
        return 1;
    }
    for (int i = 2; i <= n; i++) {
        is_prime[i] = 1;
    }

    // mark non-primes <= n using Sieve of Eratosthenes
    for (int d=2;d*d<=n;d++) {
        // if d is prime, then mark multiples of d as non-prime
        // suffices to consider multiples d*d, d*d+d, d*d+2d, ...
        if (is_prime[d]) {
            for (int c=d*d;c<=n;c+=d) {
                is_prime[c] = 0;
            }
        }
    }

    // count primes
    int primes = 0;
    for (int i = 2; i <= n; i++) {
        if (is_prime[i]) {
            primes++;
        }
    }
    printf ("The number of primes <= %d is %d",n,primes);

    // free the dynamically allocated array
    free(is_prime);
}


Writing count_primes_v2.c


## Let's switch on the optimizing compiler using -O3 and -march=native for better performance.

In [55]:
!gcc -O3 -march=native -o count_primes_v2 count_primes_v2.c

## Let's try n equal to one hundred million.

In [56]:
!time ./count_primes_v2 100000000

The number of primes <= 100000000 is 5761455
real	0m2.368s
user	0m2.072s
sys	0m0.242s


## Let's try n equal to one billion.

## In this case our **is_prime** array is roughly 4 billion bytes or 4 Gigabytes!

In [57]:
!time ./count_primes_v2 1000000000

The number of primes <= 1000000000 is 50847534
real	0m27.184s
user	0m24.423s
sys	0m2.376s


## By contrast, the stack size is 8192 kilobytes which is roughly 8 Megabytes!

In [58]:
!ulimit -s

8192
