# Lecture 7 : Characters and Strings

# Part 1 : Characters

## The C *char* type is one byte that is used to store characters such as the letters *a* and *b* and the punctuation symbol *!*.

## Working with the C char type and arrays of the C char called *strings* is how we process text in C.

## To see the characters and letters that certain values correspond to we use an [ASCII-TABLE](https://www.ascii-code.com/).

In [None]:
%%writefile char.c
#include <stdio.h>

int main () {
    char c = 'A';
    printf ("c as a number is %d\n",c);
    printf ("c as a character is %c\n",c);
}

Writing char.c


In [None]:
!gcc -o char char.c

In [None]:
!./char

c as a number is 65
c as a character is A


## Exercise : What is the range of ASCII values for the upper case letters?  Lower case letters?

# Part 2 : Strings

## A C string is a **null-terminated** array of characters.

## Strings can be initialized using the syntax:

    char str[] = "Hokies";

## The array *str* is actually of length 7 since it is null-terminated.

## Here is a short C program that illustrates the array *str*.

In [None]:
%%writefile hokies.c
#include <stdio.h>

int main () {
    char str[] = "Hokies";
    for (int i=0;i<7;i++) {
        printf ("%d\n",str[i]); // print the ASCII code of each character
    }
}

Writing hokies.c


In [None]:
!gcc -o hokies hokies.c

In [None]:
!./hokies

72
111
107
105
101
115
0


## Exercise : Lookup each of the above ASCII codes to verify the characters in *Hokies*.

## Note that the ASCII value immediately following the six characters in *Hokies* is 0.  

## This 0 is the ASCII value for the *null character*.  

## The inclusion of this *null character* in C strings is why we say that C strings are *null-terminated*.

## The null character is critical for C strings because we use it to determine how long strings are.  

### Remember that in C arrays are not objects so we need a separate mechanism to keep track of the length.

## Note that the line of C code
    char str[] = "Hokies";
## is equivalent to

    char str[7] = { 'H', 'o', 'k', 'i', 'e', 's', '\0' };

## We prefer the first version because it is much easier to read!

## Also note that the null character can be specified using *'\0'*.  

## This is similar to how we specify the new line character using *'\n'*.

## There are various C functions for processing strings.  

## You can include the interfaces to these functions using:
    #include <string.h>

## One very useful string function is *strlen* which returns the number of characters in a string

## Note that the count return by *strlen* does not include the null-terminator.

## Note that the function *strlen* returns a long unsigned int.  

## We use the format specifier *%lu* to print a long unsigned int.

## Also note that we print a C string using the format specifier *%s*.  

## The function *printf* prints the characters of a given string until it encounters the null-terminator.

In [None]:
%%writefile strlen.c
#include <stdio.h>
#include <string.h>

int main () {
    char str[] = "Hokies";
    printf ("The length of the string %s is %lu.\n",str,strlen(str));
}

Writing strlen.c


In [None]:
!gcc -o strlen strlen.c

In [None]:
!./strlen

The length of the string Hokies is 6.


# Part 3 : String and Pointers

## A pointer to a C string is a character pointer.
    char* str = "Hokies"

## Note that the above line of C code is very different than:
    char str[] = "Hokies"

## In the first line of C code, the pointer *str* points to a string that is stored in **constant memory**.  

## In the second line of C code, *str* is an array of characters with size 7 (6 for the letters in Hokies and 1 for the null-teriminator) which will be **initialized** to the contain the characters in the given string.

## Be careful ... constant memory is **read only**!

## Writing to constant memory will trigger a segmentation fault!

In [None]:
%%writefile danger.c
#include <stdio.h>

int main () {
    char* str = "Hokies";
    str[0] = 'h';
    printf ("The string in lower case is %s.\n",str);
}


Writing danger.c


In [None]:
!gcc -o danger danger.c

In [None]:
!./danger > out.txt

/bin/bash: line 1:   274 Segmentation fault      (core dumped) ./danger > out.txt


In [None]:
!cat out.txt

## Exercise : Fix the above code so that it works as expected.

## Here is an interesting example that uses an array of string pointers.  

## Exercise : Predict what the code prints before running it!

In [None]:
%%writefile "mystery.c"
#include <stdio.h>

int main () {
    char* a[6] = { "Planet", "Hello", "Earth", "Go", "There", "Let's" };
    char* b[6] = { "Red", "Blue", "Hokies", "Green", "World", "Orange" };
    char* c[3] = { a[5], a[3], b[2] };
    char** d = c;
    printf ("%s %s %s!\n",d[0],d[1],d[2]);
}

Writing mystery.c


In [None]:
!gcc -o mystery mystery.c

## Uncomment the following line when you are ready to check your answer!

In [None]:
#!./mystery

## Note that in line 7 we declare *d* to have type char**.  

## This literally means that *d* is a pointer to a pointer to a character.  

## Where have we seen a variable of type *char*** before?

# Part 4 : Command Line Arguments Revisited

## C command line arguments are strings!  

## The line of code
    int main (int argc, char** argv) {

## specifies the argv argument to have type pointer to a pointer to a character.

## More simply, *argv* is a pointer that points to an array of string pointers.  
  
## When we dereference *argv* using *argv[0]* we get a pointer to the first command line argument which is a string.

## Exercise: what does *argv[1]* point to?

## Here is an example C code that prints a command line argument in lower case.

## This example illustrates that we can overwrite C command line arguments.

## Exercise : What properties of the ASCII-TABLE are we taking advantage of in the code below?

In [None]:
%%writefile lower.c
#include <stdio.h>
#include <string.h>

int main (int argc, char** argv) {
    if (argc < 2) {
        printf ("command usage: %s str\n",argv[0]);
        return 1;
    }
    char* str = argv[1];
    for (int i=0;i<strlen(str);i++) {
        if ((str[i] >= 'A') && (str[i] <= 'Z')) {
            str[i] += 'a'-'A';
        }
    }
    printf ("The command line argument in lower case is %s.\n",str);
}

Writing lower.c


In [None]:
!gcc -o lower lower.c

In [None]:
!./lower HOKIES

The command line argument in lower case is hokies.


## Here is a code that ensures that all command line arguments contain only lower case letters.  

## This example illustrates that C strings are passed to functions by pointer (just like other arrays in C).

In [None]:
%%writefile check.c
#include <stdio.h>
#include <string.h>
#include <stdbool.h>

bool is_lower(char* str) {
    for (int i=0;i<strlen(str);i++) {
        if ((str[i] < 'a') || (str[i] > 'z')) {
            return false;
        }
    }
    return true;
}

int main (int argc, char** argv) {
    for (int i=1;i<argc;i++) {
        if (!is_lower(argv[i])) {
            printf ("The command line argument %s is not all lower case letters.\n",argv[i]);
            return 0;
        }
    }
    printf ("All command line arguments have only lower case letters.\n");
}

Writing check.c


In [None]:
!gcc -o check check.c

In [None]:
!./check Hello world

The command line argument Hello is not all lower case letters.


In [None]:
!./check this is great!

The command line argument great! is not all lower case letters.


In [None]:
!./check 2 + 2 = 4

The command line argument 2 is not all lower case letters.


In [None]:
!./check lets go

All command line arguments have only lower case letters.


# Part 5 : Working with a list of possible Wordle answers.

## Let's use Git to grab a file containing possible Wordle answers.

In [None]:
!git clone https://code.vt.edu/jasonwil/cmda3634_materials.git

Cloning into 'cmda3634_materials'...
remote: Enumerating objects: 55, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 55 (delta 1), reused 0 (delta 0), pack-reused 47[K
Receiving objects: 100% (55/55), 567.62 KiB | 3.66 MiB/s, done.
Resolving deltas: 100% (9/9), done.


In [None]:
!cp cmda3634_materials/PRO02/* .

## The number of words in the file:

In [None]:
!wc -l answers.txt

2309 answers.txt


## The first 10 answers:

In [None]:
!head -10 answers.txt

aback
abase
abate
abbey
abbot
abhor
abide
abled
abode
abort


## Here is a C program that searches the Wordle answer list for a given word.  

## Note that we have to very careful when using *scanf* to read a string from a file.  

## Consider the following code snippet:

    char next[6];
    while (scanf("%5s",next) == 1)

## A wordle word has 5 characters.

## We do set next to have size 6 instead of 5?

## By using the format specifier *%5s* we instruct *scanf* to not read strings that are longer than 5 characters to avoid going off the end of the *next* array.

## Note that *scanf* null-terminates the string it reads in.

## The string function *strcmp* returns 0 if the two string arguments are identical (i.e. they have the same length and the same characters).

In [None]:
%%writefile search.c
#include <stdio.h>
#include <string.h>

int main (int argc, char** argv) {
    if (argc < 2) {
        printf ("command usage: %s %s\n",argv[0],"word");
        return 1;
    }
    char* word = argv[1];
    char next[6]; // Need 5 chars for Wordle word and 1 for null terminator.
    while (scanf("%5s",next) == 1) { // %5s tells scanf to read at most 5 characters
        if (strcmp(word,next) == 0) { // strcmp returns 0 if the strings are equal
            printf ("%s is a possible Wordle answer.\n",word);
            return 0;
        }
    }
    printf ("%s is not a possible Wordle answer.\n",word);
}

Writing search.c


In [None]:
!gcc -o search search.c

In [None]:
!cat answers.txt | ./search hello

hello is a possible Wordle answer.


In [None]:
!cat answers.txt | ./search aargh

aargh is not a possible Wordle answer.


## Here is a C program that determines the most frequent letter in a given blank.  

## The command line argument blank is a number from 0 to 4 where 0 is the first blank, 1 is the second blank, etc.

## Exercise : Carefully explain what the following line of code is doing.

    count[next[blank]-'a'] += 1;

In [None]:
%%writefile frequent.c
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char** argv) {
    if (argc < 2) {
        printf ("command usage: %s %s\n",argv[0],"blank");
    }
    int blank = atoi(argv[1]); // blank is a number from 0 to 4
    int count[26] = { 0 };
    char next[6];
    int total_words = 0;
    while (scanf("%5s",next) == 1) {
        count[next[blank]-'a'] += 1;
        total_words += 1;
    }
    int max_count = 0;
    char most_common;
    for (int i=0;i<26;i++) {
        if (count[i] > max_count) {
            max_count = count[i];
            most_common = 'a'+i;
        }
    }
    printf ("The most frequently occuring letter in blank %d is %c.\n",
            blank,most_common);
    printf ("The letter %c occurs %d times in blank %d out of %d total words.\n",
            most_common,max_count,blank,total_words);
}

Writing frequent.c


In [None]:
!gcc -o freqeunt frequent.c

In [None]:
!cat answers.txt | ./freqeunt 0

The most frequently occuring letter in blank 0 is s.
The letter s occurs 365 times in blank 0 out of 2309 total words.
