# Lecture 6 : Real Numbers and More Arrays

# Part 1 : Working with Real Numbers in C

## Here is a C program that computes the average of numbers in *stdin*.

In [1]:
%%writefile average_v1.c
#include <stdio.h>

int main () {
    int next;
    int sum = 0;
    int count = 0;
    while (scanf("%d",&next) == 1) {
        sum += next;
        count += 1;
    }
    printf ("average = %d\n",sum/count);
}

Overwriting average_v1.c


In [2]:
!gcc -o average_v1 average_v1.c

In [3]:
!echo 1 2 3 4 5 | ./average_v1

average = 3


## Note that the average of the numbers $1$, $2$, $3$, $4$, $5$ is indeed $3$.

In [4]:
!echo 1 2 3 4 | ./average_v1

average = 2


## In this case the average should be $10/4 = 2.5$ but the program prints $2$ since we are using *integer division*.  

## Since the average of a set of numbers is typically a real number, we need to use a floating point type such as a float or double to store the average.

In [5]:
%%writefile average_v2.c
#include <stdio.h>

int main () {
    int next;
    int sum = 0;
    int count = 0;
    while (scanf("%d",&next) == 1) {
        sum += next;
        count += 1;
    }
    double average = sum/count;
    printf ("average = %f\n",average);
}

Overwriting average_v2.c


## Note that on line 13 we use *%f* for the format specifier for printing a float or a double type.

In [6]:
!gcc -o average_v2 average_v2.c

In [7]:
!echo 1 2 3 4 | ./average_v2

average = 2.000000


## We see that this version still computes the incorrect average.  

## Note that in line 12 we are computing the quantity sum/count and assigning it to average.  

## However, since both variables sum and count have *int* type, the division is integer division.  

## The next version correct this problem.  



In [8]:
%%writefile average_v3.c
#include <stdio.h>

int main () {
    int next;
    int sum = 0;
    int count = 0;
    while (scanf("%d",&next) == 1) {
        sum += next;
        count += 1;
    }
    double average = (double)sum/count;
    printf ("average = %.2f\n",average);
}

Overwriting average_v3.c


## Note this time on line 13 we are using the *%.2f* format specifier.  This *.2* tells C to round and print the result to 2 decimal places.  

In [9]:
!gcc -o average_v3 average_v3.c

In [10]:
!echo 1 2 3 4 | ./average_v3

average = 2.50


## In this case we are getting the correct answer 2.5.  

## In line 12, we compute (double)sum/count.  

## Although both sum and count are int we are casting sum to a *double*.  

## In this case C promotes the integer count to *double* type as well and the division is floating point instead of integer division.

In [11]:
!echo 1.5 2.5 | ./average_v3

average = 1.00


## In this case we get the incorrect answer since the inputs are real numbers instead of integers.  

## To fix this we need to change line 8 to read double types using the *%lf* format specifier.  

## **Be careful** not to use the format *%f* to read a double type using *scanf*.

## We also need to change types of next and sum to *double*.  

In [12]:
%%writefile average_v4.c
#include <stdio.h>

int main () {
    double next;
    double sum = 0.;
    int count = 0;
    while (scanf("%lf",&next) == 1) {
        sum += next;
        count += 1;
    }
    float average = sum/count;
    printf ("average = %.2f\n",average);
}

Overwriting average_v4.c


In [13]:
!gcc -o average_v4 average_v4.c

In [14]:
!echo 1.5 2.5 | ./average_v4

average = 2.00


In [15]:
!echo 1.5 2.5 3.5 | ./average_v4

average = 2.50


# Part 2 : Float or Double?

## A C float has 32 bits and a C double has 64 bits (same as Java).

## To see the difference in accuracy between a float and a double, let's approximate the value of:

$$\large{e = 2.718281828459045}$$

## We can approximate e using the Taylor series formula:

$$\large{e \approx 1 + \frac{1}{1!} + \frac{1}{2!} + \frac{1}{3!} + \cdots}$$

## Version 1 adds up the first n terms of the above formula and accumulates the result in a *float*.  

In [16]:
%%writefile approx_e_v1.c
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char* argv[]) {
    if (argc < 2) {
        printf ("command usage: %s %s\n",argv[0],"n");
        return 1;
    }
    int n = atoi(argv[1]);
    float approx_e = 0.;
    long long fact = 1;
    for (int i=1;i<=n;i++) {
        approx_e += 1.0/fact;
        fact *= i;
    }
    printf ("exact  value of e is %.15f\n",2.718281828459045);
    printf ("approx value of e is %.15f\n",approx_e);
}

Overwriting approx_e_v1.c


In [17]:
!gcc -o approx_e_v1 approx_e_v1.c

In [18]:
!./approx_e_v1 10

exact  value of e is 2.718281828459045
approx value of e is 2.718281745910645


## With 10 terms we estimated the value of $e$ correctly to 6 decimal digits.  

In [19]:
!./approx_e_v1 20

exact  value of e is 2.718281828459045
approx value of e is 2.718281984329224


## With 20 terms we still only estimated the value of $e$ correctly to 6 decimal digits.  

## Version 2 adds up the first n terms of the above formula and accumulates the result in a *double*.

In [20]:
%%writefile approx_e_v2.c
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char* argv[]) {
    if (argc < 2) {
        printf ("command usage: %s %s\n",argv[0],"n");
        return 1;
    }
    int n = atoi(argv[1]);
    double approx_e = 0.;
    long long fact = 1;
    for (int i=1;i<=n;i++) {
        approx_e += 1.0/fact;
        fact *= i;
    }
    printf ("exact  value of e is %.15f\n",2.718281828459045);
    printf ("approx value of e is %.15f\n",approx_e);
}

Overwriting approx_e_v2.c


In [21]:
!gcc -o approx_e_v2 approx_e_v2.c

In [22]:
!./approx_e_v2 10

exact  value of e is 2.718281828459045
approx value of e is 2.718281525573192


## With 10 terms we estimated the value of $e$ correctly to 6 decimal digits.

In [23]:
!./approx_e_v2 20

exact  value of e is 2.718281828459045
approx value of e is 2.718281828459046


## With 20 terms we estimated the value of $e$ correctly to 14 decimal digits.

In [24]:
!./approx_e_v2 30

exact  value of e is 2.718281828459045
approx value of e is 2.718281828459046


## With 30 terms we still only estimated the value of $e$ correctly to 14 decimal digits.

## The decision to use float versus double is a tradeoff between accuracy and storage/performance.

## In this class we will normally use double for the additional accuracy.

## In certain machine learning applications, high accuracy is not needed so it is common to use *float* instead of *double* (or even floating point types that use fewer than 32 bits)!  This is especially true for large scale ML algorithms implemented using GPUs where high performance is essential.

# Part 3 : Sample Standard Deviation

## Here is a C program that computes the average and sample standard deviation of real number scores in *stdin*.  For the standard deviation we use the formula:

$$\sigma = \sqrt{ \frac{1}{N-1} \sum_{i=1}^N (x_i - \bar{x})^2}$$

## where

$$\bar{x} = \displaystyle\frac{1}{N} \displaystyle\sum_{i=1}^N x_i$$

## In this case we will have to store the data in an array since we need to make two passes over it.

In [25]:
%%writefile stdev.c
#include <stdio.h>
#include <math.h>

#define MAX_SCORES 1000

int main () {
    double scores[MAX_SCORES];
    double next;
    int num_scores = 0;
    while (scanf("%lf",&next) == 1) {
        if (num_scores < MAX_SCORES) {
            scores[num_scores] = next;
            num_scores += 1;
        } else {
            printf ("Too many scores!\n");
            return 1;
        }
    }
    double sum = 0.;
    for (int i=0;i<num_scores;i++) {
        sum += scores[i];
    }
    double mean = sum/num_scores;
    printf ("mean = %.2f\n",mean);
    double sum_sqs = 0.;
    for (int i=0;i<num_scores;i++) {
        sum_sqs += (scores[i]-mean)*(scores[i]-mean);
    }
    double var = sum_sqs/(num_scores-1);
    printf ("standard deviation = %.2f\n",sqrt(var));
}

Overwriting stdev.c


In [26]:
!gcc -o stdev stdev.c -lm

In [27]:
!echo 86.5 81.0 92.5 86.5 74.5 57.5 76.5 94.5 66.5 98.5 23.5 47.5 74.5 77.5 88.0 | ./stdev

mean = 75.03
standard deviation = 19.83


# Part 4 : Finding the Best Wordle Start Word

In [28]:
!wget -O answers.txt https://gist.githubusercontent.com/cfreshman/a7b776506c73284511034e63af1017ee/raw/60531ab531c4db602dacaa4f6c0ebf2590b123da/wordle-nyt-answers-alphabetical.txt

--2024-02-05 18:33:30--  https://gist.githubusercontent.com/cfreshman/a7b776506c73284511034e63af1017ee/raw/60531ab531c4db602dacaa4f6c0ebf2590b123da/wordle-nyt-answers-alphabetical.txt
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13853 (14K) [text/plain]
Saving to: ‘answers.txt’


2024-02-05 18:33:30 (29.1 MB/s) - ‘answers.txt’ saved [13853/13853]



In [29]:
!wc -l answers.txt

2308 answers.txt


In [30]:
%%writefile start_v1.c
#include <stdio.h>
#include <string.h>

int main () {
    // calculate the frequency of each letter
    int count[5][26] = { { 0 }, { 0 }, { 0 }, { 0 }, { 0 } };
    char next_word[6];
    while (scanf("%5s",next_word) == 1) {
        for (int b=0;b<5;b++) {
            count[b][next_word[b]-'a'] += 1;
        }
    }
    // create a string of most frequent letters
    char start[6];
    start[5] = '\0';
    for (int b=0;b<5;b++) {
        int max_count = 0;
        for (int l=0;l<26;l++) {
            if (count[b][l] > max_count) {
                max_count = count[b][l];
                start[b] = 'a'+l;
            }
        }
    }
    printf ("best start string is %s\n",start);
}

Overwriting start_v1.c


In [31]:
!gcc -o start_v1 start_v1.c

In [32]:
!cat answers.txt | ./start_v1

best start string is saaee


In [33]:
%%writefile start_v2.c
#include <stdio.h>
#include <string.h>

#ifndef MAX_WORDS
#define MAX_WORDS 3000
#endif

int main () {

    // read in the list of Wordle answers
    char words[MAX_WORDS][6];
    char next_word[6];
    int total_words = 0;
    while (scanf("%5s",next_word) == 1) {
        if (total_words < MAX_WORDS) {
            strcpy(words[total_words],next_word);
            total_words += 1;
        } else {
            printf ("Too many words!\n");
            return 1;
        }
    }

    // calculate frequency of each letter
    int count[5][26] = { { 0 }, { 0 }, { 0 }, { 0 }, { 0 } };
    for (int w=0;w<total_words;w++) {
        for (int b=0;b<5;b++) {
            count[b][words[w][b]-'a'] += 1;
        }
    }

    // find the Wordle answer with the max score
    char* start;
    int max_score = 0;
    for (int w=0;w<total_words;w++) {
        int score = 0;
        for (int b=0;b<5;b++) {
            score += count[b][words[w][b]-'a'];
        }
        if (score > max_score) {
            max_score = score;
            start = words[w];
        }
    }

    printf ("best start word is %s with a score of %d\n",start,max_score);
}

Overwriting start_v2.c


In [34]:
!gcc -o start_v2 start_v2.c

In [35]:
!echo start zebra timer squad | ./start_v2

best start word is start with a score of 7


In [36]:
!cat answers.txt | ./start_v2

best start word is slate with a score of 1432


In [37]:
!wget -O words.txt https://gist.githubusercontent.com/cfreshman/d97dbe7004522f7bc52ed2a6e22e2c04/raw/633058e11743065ad2822e1d2e6505682a01a9e6/wordle-nyt-words-14855.txt

--2024-02-05 18:33:31--  https://gist.githubusercontent.com/cfreshman/d97dbe7004522f7bc52ed2a6e22e2c04/raw/633058e11743065ad2822e1d2e6505682a01a9e6/wordle-nyt-words-14855.txt
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 89129 (87K) [text/plain]
Saving to: ‘words.txt’


2024-02-05 18:33:31 (6.18 MB/s) - ‘words.txt’ saved [89129/89129]



In [38]:
!cat words.txt | ./start_v2

Too many words!


In [39]:
!wc -l words.txt

14854 words.txt


In [40]:
!gcc -DMAX_WORDS=15000 -o start_v2 start_v2.c

In [41]:
!cat words.txt | ./start_v2

best start word is sanes with a score of 12337
