<h1 style="text-align: center">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY MicroProject</div>
<span style="">MicroProject: Building a Scene Recognition Model form Video Frames</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/microproject/video-frame-scene-recognition-model/">https://discovery.cs.illinois.edu/microproject/video-frame-scene-recognition-model/</a></div>
</h1>

<hr style="color: #DD3403;">

## Data Source: Frames of a Video

Visual images are an important part of all media and Data Scientists are often using images as data sources.  In this MicroProject, you will create a simple model to detect the amount of time spent in two different "scenes" we used when creating office-hour style videos for Data Science DISCOVERY.  To do this, you will learn how to import an entire folder of images, preform image analysis, and create your own model without using a pre-build library.  Let's nerd out! :)

> *This MicroProject was inspired by a podcast that we recently recorded with the team from the Center for Innovation in Teaching and Learning who helped produce our video.  To learn the background and hear from Karle and Wade about the journey of creating DISCOVERY, go over and listen to our episode on the "Teach Talk Listen Learn Podcast" where talk with TTLL host Bob Dignan and our CITL video producer Eric Schumacher: https://citl.illinois.edu/citl-101/teaching-learning/teach-talk-listen-learn*


## Loading Video Frames

We have provided you with one frame every second from our video [*"Outliers Impact on Correlation (m6-02b)"*](https://www.youtube.com/watch?v=bd6hQ2UcIJc) that is used as part of our [DISCOVERY lecture covering Correlation](https://discovery.cs.illinois.edu/learn/Towards-Machine-Learning/Correlation/).  Each of these frames are in the `frames` sub-folder.

The `skimage` library is commonly used to load image data into Python.  Specifically:

- The full function name we will be using is `skimage.io.imread(filename)`.  This function will read a filename and return the pixel color for every pixel in the image.

- To use the `imread` function, you will need to either do one of the following:

    1. Import the entire `skimage` library by using the import line: `import skimage`.  After importing all of `skimage`, you will call the function using it's fully qualified name: `skimage.io.imread(filename)`.
    
    **ALTERATIVELY**
    
    2. Import only the `imread` function by using the more specific import line: `from sklearn.io import imread`.  After importing only `imread`, you will call the function directly: `imread(filename)`

    *(People's preference differs on how they prefer to import and use libraries.  Both techniques work! :))*

### Read Pixel Data for `frames/frame_0001.jpg`

As noted earlier, we have provided a `frames` directory with all of the frames.

In the following cell, store the pixel color data from the file named `frames/frame_0001.jpg` image in the variable `pixels` by using the `imread` function:


In [7]:
import skimage
pixels = skimage.io.imread('frames/frame_0001.jpg')
pixels

array([[[ 91,  83,  80],
        [ 91,  83,  80],
        [ 91,  83,  80],
        ...,
        [ 75,  72,  79],
        [ 80,  78,  83],
        [ 83,  81,  86]],

       [[ 91,  83,  80],
        [ 91,  83,  80],
        [ 91,  83,  80],
        ...,
        [ 73,  70,  77],
        [ 79,  77,  82],
        [ 83,  81,  86]],

       [[ 91,  83,  80],
        [ 91,  83,  80],
        [ 91,  83,  80],
        ...,
        [ 69,  66,  73],
        [ 77,  75,  80],
        [ 83,  81,  86]],

       ...,

       [[174, 142, 121],
        [174, 142, 121],
        [174, 142, 121],
        ...,
        [163, 131, 110],
        [163, 131, 110],
        [163, 131, 110]],

       [[174, 142, 121],
        [174, 142, 121],
        [175, 143, 122],
        ...,
        [162, 131, 110],
        [162, 131, 110],
        [162, 131, 110]],

       [[173, 141, 120],
        [174, 142, 121],
        [175, 143, 122],
        ...,
        [162, 131, 110],
        [162, 131, 110],
        [162, 131, 110]]

### 🔬 Checkpoint Tests 🔬

In [6]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.
tada = "\N{PARTY POPPER}"

assert("pixels" in vars())
assert(pixels.shape == (360, 640, 3))
assert(pixels[0][0][0] == 91)

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Part 1: Storing Average Pixel Color

The **shape** of your data is the `rows` by `columns` by `color values` as 3-dimensional list.  Here's a formatted view of your `pixels` data:

```
[
  [ [91, 83, 80], [91, 83, 80], [91, 83, 80] ], ... ],   # Row #1
  [ [91, 83, 80], [91, 83, 80], [91, 83, 80] ], ... ],   # Row #2
  ...                                                    # ...
]
```

The current shape of `pixels` is 360 rows by 640 columns by 3 colors (`360` x `640` x `3`).  Each of the three colors represent the three color channels on a screen: red, green, and blue.

Using `pixel.mean()`, we find the average color grouping **ALL** the color channels (combining blues and reds and greens together).  Try it out:


In [8]:
pixels.mean()

72.18011863425926

This value is not very useful.  It is the average of red, green, and blue all lumped together -- it would be far more useful to find the average **red**, average **green**, and average **blue** independently.

To do that, we first need to "flatten" the list so that we have a list of only color data instead of a list of rows, columns, and then color data.  That means we want our list to look like the following:

```
[
  [ 91, 83, 80 ],    # Pixel #1 color data
  [ 91, 83, 80 ],    # Pixel #2 color data
  [ 91, 83, 80 ],    # Pixel #3 color data
  ...
]
```

### Using `pixels.reshape()`

Now that we have the desired shape of the list, the `reshape` function can do the hard work!  We know we want the final shape to be `?`x `3`.  As long as you only have one unknown dimensions, Python allows you to provide a `-1` and it will place all of the data there.

That means `pixels.reshape(-1, 3)` will reshape our list to be a single long list of color data.  Let's try out that transformation:

In [9]:
pixels = pixels.reshape(-1, 3)
pixels

array([[ 91,  83,  80],
       [ 91,  83,  80],
       [ 91,  83,  80],
       ...,
       [162, 131, 110],
       [162, 131, 110],
       [162, 131, 110]], dtype=uint8)

Finally, we want the average value of each element of the list.  To do this, `pixels.mean(axis=0)` finds the average color of each element of our newly formatted list of pixels:

In [10]:
pixels.mean(axis=0)

array([88.65917535, 67.45620226, 60.4249783 ])

### Puzzle 1.1: Finding the Average Color of One Image

Given the output you learned above, write the Python code to store `pixel`'s average red value in `r`, average green value in `g`, and average blue value in `b`:

In [11]:
r = pixels.mean(axis=0)[0]
g = pixels.mean(axis=0)[1]
b = pixels.mean(axis=0)[2]

In [12]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.
tada = "\N{PARTY POPPER}"

import math
assert("r" in vars())
assert("g" in vars())
assert("b" in vars())
assert(r > 88 and r < 89)
assert(g > 67 and g < 68)
assert(b > 60 and b < 61)

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


### Puzzle 1.2: Finding the Average Color of All Images

The following code loops through every file in the `frames` directory -- this will include `frame_0001.jpg` (like you analyzed already) and also `frame_0002.jpg`, `frame_0003.jpg`, and all 300+ frames!

Create a DataFrame where each row is one frame with the following four columns:
- `frame`, the filename of the frame
- `r`, the average red color of the frame
- `g`, the average green color of the frame
- `b`, the average blue color of the frame

The structure of the code should be nearly **identical to writing a simulation**.  For "Step 3" when you would normally simulate a random variable for the real-world event, you should instead use the real world data.  This real world data will be filename `frame`, and the `r`, `g`, and `b` values should be the average color of that frame.

- See: https://discovery.cs.illinois.edu/learn/Simulation-and-Distributions/Simple-Simulations-in-Python/

In [15]:
import glob
import os
import pandas as pd

data = []
for frame in glob.glob(os.path.join("frames", "*.jpg")): 
  # `frame`` contains the filename of the frame (ex: "frames/frame_0001.jpg").  Use it for `imread` to read the frame image data.
  pixels = skimage.io.imread(frame).reshape(-1,3)
  r = pixels.mean(axis=0)[0]
  g = pixels.mean(axis=0)[1]
  b = pixels.mean(axis=0)[2]  
  d = {'frame': frame, 'r': r, 'g': g, 'b': b}
  data.append(d)

df = pd.DataFrame(data)
df

Unnamed: 0,frame,r,g,b
0,frames\frame_0001.jpg,88.659175,67.456202,60.424978
1,frames\frame_0002.jpg,88.697865,67.453529,60.475660
2,frames\frame_0003.jpg,88.028351,66.913845,60.064592
3,frames\frame_0004.jpg,88.825629,67.340347,60.491645
4,frames\frame_0005.jpg,88.211714,66.979661,59.983173
...,...,...,...,...
325,frames\frame_0326.jpg,7.470391,7.473355,7.479188
326,frames\frame_0327.jpg,7.469779,7.472743,7.478576
327,frames\frame_0328.jpg,7.480234,7.481519,7.487826
328,frames\frame_0329.jpg,7.480004,7.481289,7.487595


### 🔬 Checkpoint Tests 🔬

In [14]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.
tada = "\N{PARTY POPPER}"

import math
assert("df" in vars())
assert(len(df) == 330)
assert("r" in df)
assert("g" in df)
assert("b" in df)
assert("frame" in df)
assert( abs( df[ df.frame.str.endswith("_0001.jpg") ]["r"].sum() - 88 ) < 1 )

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Part 2: Create a Simple Classifier

In the DISCOVERY lecture videos, there are two primary "scenes" in the video:

1. **"Office Hours Studio Scene"**, where Karle and Wade are talking to each other and the audience,

2. **"Notebook Scene"**, where the notebook is displayed

View the `frames` folder on your computer and find **at least three more frames** that are in the "office hours studio scene" and **at least three more frames** that are in the "notebook scene".  Add the frames you found to the list below:

In [16]:
# List of at least four office hour frames by the filename's frame number:
office_hour_frames = [1, 4, 166, 171]

# List of at least four notebook frames by the filename's frame number:
notebook_frames = [30, 191, 185, 187]

### Observing the Average Colors of Your Frames

The following code uses your sample frames to display the average color values for your selected frames.  This information about the average color of the two different type of frames will be useful for you to build the classifier in the next section.

You may want to add more frames into your list above to get more data to help build your classifier.  Run the following code to see the average color values:

In [17]:
import os

print("== Office Hour Frames ==")
print( df[ df["frame"].isin( [os.path.join("frames", f"frame_{frame:04d}.jpg") for frame in office_hour_frames]) ] )
print()
print("== Notebook Frames ==")
print( df[ df["frame"].isin( [os.path.join("frames", f"frame_{frame:04d}.jpg") for frame in notebook_frames]) ] )

== Office Hour Frames ==
                     frame          r          g          b
0    frames\frame_0001.jpg  88.659175  67.456202  60.424978
3    frames\frame_0004.jpg  88.825629  67.340347  60.491645
165  frames\frame_0166.jpg  89.799861  69.913568  61.715846
170  frames\frame_0171.jpg  90.530456  70.301437  62.171632

== Notebook Frames ==
                     frame           r           g           b
29   frames\frame_0030.jpg  237.225595  236.513451  236.777122
184  frames\frame_0185.jpg  233.223641  232.456962  230.245742
186  frames\frame_0187.jpg  233.127865  232.372739  230.165365
190  frames\frame_0191.jpg  233.117088  232.354644  230.103359


### Create Your Classifier Function

A **classifier function** is a function that takes data and gives a classification for that data.  Create a new function, `classifyFrame` that receives an `r`, `g`, and `b` value.

Using information from your frames above, have the function return the string `"office hour"` or `"notebook"` based on the values of `r`, `g`, and `b`.

**IMPORTANT**: Make sure your classifier can handle **ANY** input -- even frames you have not seen before!  For example, you might decide that you will call a frame an `"office hour"` frame if the sum of `r`, `g` and `b` is greater than 100 and otherwise it's a `"notebook"` scene.

In [18]:
def classifyFrame(r, g, b):
  # Return either "office hour" or "notebook" based on the values of `r`, `g`, and `b`.
  if (r + g + b > 600):
    return "notebook"
  else:
    return "office hour"

### 🔬 Checkpoint Tests 🔬

In [19]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.
tada = "\N{PARTY POPPER}"

r = classifyFrame(0, 0, 0)
assert(r == "notebook" or r == "office hour")

r = classifyFrame(255, 255, 255)
assert(r == "notebook" or r == "office hour")

r = classifyFrame(0, 255, 255)
assert(r == "notebook" or r == "office hour")

r = classifyFrame(255, 255, 0)
assert(r == "notebook" or r == "office hour")

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Part 3: Using Your Classifier!

Now that we have a classifier, we should run it on every frame!

The following cell runs your `classifyFrame` classifier on every frame and adds a new column `scene` and displayed 20 random rows:

In [20]:
df["scene"] = df.apply(lambda row: classifyFrame(row.r, row.g, row.b), axis=1)
df.sample(20)

Unnamed: 0,frame,r,g,b,scene
1,frames\frame_0002.jpg,88.697865,67.453529,60.47566,office hour
186,frames\frame_0187.jpg,233.127865,232.372739,230.165365,notebook
319,frames\frame_0320.jpg,221.227565,71.838433,54.457305,office hour
109,frames\frame_0110.jpg,230.717266,230.119089,227.517656,notebook
140,frames\frame_0141.jpg,230.853845,229.936962,227.710512,notebook
272,frames\frame_0273.jpg,243.548924,242.609227,241.175135,notebook
115,frames\frame_0116.jpg,88.151432,68.003542,60.601997,office hour
114,frames\frame_0115.jpg,87.702435,67.75362,60.350282,office hour
248,frames\frame_0249.jpg,244.143333,243.400534,241.659718,notebook
139,frames\frame_0140.jpg,87.663203,67.649766,60.055399,office hour


### 🔬 Checkpoint Tests 🔬

In [21]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.
tada = "\N{PARTY POPPER}"

assert("scene" in df)

assert(len(df[ df.scene == "notebook" ]) > 100), "There are more than 100 frames that are clearly the notebook.  Make sure your classifier is able to pick up the notebook scene accurately."
assert(len(df[ df.scene == "office hour" ]) > 75), "There are more than 75 frames that are clearly the office hour set.  Make sure your classifier is able to pick up the office hour set scene accurately."
assert(len(df[ df.scene == "notebook" ]) + len(df[ df.scene == "office hour" ]) == len(df)), "Your classifier should must always identify a scene as either a notebook or office hour.  Make sure your classifier always returns one of those two values."

assert( len( df[ (df.frame.str.endswith("0001.jpg")) & (df.scene == "office hour") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0306.jpg")) & (df.scene == "office hour") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0081.jpg")) & (df.scene == "notebook") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0191.jpg")) & (df.scene == "notebook") ] ) == 1 )

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


## Observing Results

In the next 5 cells, we display a frame and you'll run code to check what your classifier classified the frame as being!  Make sure to run the code for each frame:

### Frame #0001: Office Hours

In [22]:
df[ df.frame.str.endswith("0001.jpg") ]

Unnamed: 0,frame,r,g,b,scene
0,frames\frame_0001.jpg,88.659175,67.456202,60.424978,office hour


![Frame 0001](frames/frame_0001.jpg)

### Frame #0081: Notebook

In [23]:
df[ df.frame.str.endswith("0081.jpg") ]

Unnamed: 0,frame,r,g,b,scene
80,frames\frame_0081.jpg,230.721385,229.915091,230.48303,notebook


![Frame 0001](frames/frame_0081.jpg)

### Frame #0191: Notebook

In [24]:
df[ df.frame.str.endswith("0191.jpg") ]

Unnamed: 0,frame,r,g,b,scene
190,frames\frame_0191.jpg,233.117088,232.354644,230.103359,notebook


![Frame 0001](frames/frame_0191.jpg)

### Frame #0306: Office Hours

In [25]:
df[ df.frame.str.endswith("0306.jpg") ]

Unnamed: 0,frame,r,g,b,scene
305,frames\frame_0306.jpg,89.403867,70.149223,62.83901,office hour


![Frame 0001](frames/frame_0306.jpg)

### Frame #0320: Data Science Duo Logo???

What did you classify the DUO logo as?  It's nether one, but we don't have that option!

In [26]:
df[ df.frame.str.endswith("0320.jpg") ]

Unnamed: 0,frame,r,g,b,scene
319,frames\frame_0320.jpg,221.227565,71.838433,54.457305,office hour


![Frame 0001](frames/frame_0320.jpg)

### Frame #328: Video Credits

What did you classify the video credits as?  It's another tricky one!


In [27]:
df[ df.frame.str.endswith("0328.jpg") ]

Unnamed: 0,frame,r,g,b,scene
327,frames\frame_0328.jpg,7.480234,7.481519,7.487826,office hour


![Frame 0328](frames/frame_0328.jpg)

<hr style="color: #DD3403;">

## Part 4: Update Your Classifier to Account with an "Other" Category

Create a second classifier -- `classifyFrame2` -- that returns either `"notebook"`, `"office hour"` or `"other"`.  Your classifier should correctly handle the "Data Science Duo" (ex: #0320) frames and the "Credit" frames (ex: #0328).

In [28]:
def classifyFrame2(r, g, b):
  # Return either "office hour", "notebook", or "other" based on the values of `r`, `g`, and `b`.
  if (r + g + b > 600):
    return "notebook"
  elif (r > 200 or r < 20):
    return "other"
  else:
    return "office hour"

## Apply your `classifyFrame2` function

Using `classifyFrame2`, this code replaces the value in the column `scene` with your `classifyFrame2` classification function.  The output of this cell shows the last frames of the video, which we expect to be `"other"`:

In [29]:
df["scene"] = df.apply(lambda row: classifyFrame2(row.r, row.g, row.b), axis=1)
df.tail(20)

Unnamed: 0,frame,r,g,b,scene
310,frames\frame_0311.jpg,89.052721,70.084679,62.528559,office hour
311,frames\frame_0312.jpg,89.577539,70.261745,62.89487,office hour
312,frames\frame_0313.jpg,89.365169,70.192526,62.526467,office hour
313,frames\frame_0314.jpg,89.24036,70.183095,62.777127,office hour
314,frames\frame_0315.jpg,89.053277,70.11523,62.639093,office hour
315,frames\frame_0316.jpg,227.706259,67.024722,48.608728,other
316,frames\frame_0317.jpg,233.53987,66.995486,47.76674,other
317,frames\frame_0318.jpg,227.335317,67.101398,48.556484,other
318,frames\frame_0319.jpg,221.847739,72.007214,54.58467,other
319,frames\frame_0320.jpg,221.227565,71.838433,54.457305,other


### 🔬 Checkpoint Tests 🔬

In [30]:
## == CHECKPOINT TESTS ==
# - This read-only cell contains a "checkpoint" for this section of the MicroProejct and verifies you are on the right track.
# - If this cell results in a celebration message, you PASSED all test cases!
# - If this cell results in any errors, check you previous cells, make changes, and RE-RUN your code and then this cell.
tada = "\N{PARTY POPPER}"

assert("scene" in df)

assert(len(df[ df.scene == "notebook" ]) > 100)
assert(len(df[ df.scene == "office hour" ]) > 75)
assert(len(df[ df.scene == "other" ]) >= 15)
assert(len(df[ df.scene == "other" ]) <= 18)   # It's okay to classify the intro screens as "other" as well -- but not any others.
assert(len(df[ df.scene == "notebook" ]) + len(df[ df.scene == "office hour" ]) + len(df[ df.scene == "other" ]) == len(df))

assert( len( df[ (df.frame.str.endswith("0001.jpg")) & (df.scene == "office hour") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0306.jpg")) & (df.scene == "office hour") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0081.jpg")) & (df.scene == "notebook") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0191.jpg")) & (df.scene == "notebook") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0317.jpg")) & (df.scene == "other") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0325.jpg")) & (df.scene == "other") ] ) == 1 )
assert( len( df[ (df.frame.str.endswith("0328.jpg")) & (df.scene == "other") ] ) == 1 )

print(f"{tada} All Tests Passed! {tada}")

🎉 All Tests Passed! 🎉


<hr style="color: #DD3403;">

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub and run the GitHub Actions Grader:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and follow the instructions to commit and grade this MicroProject!