<h1 style="text-align: center;">
<div style="color: #DD3403; font-size: 60%">Data Science DISCOVERY Project #1</div>
<span style="">Project #1: Mosaic Project</span>
<div style="font-size: 60%;"><a href="https://discovery.cs.illinois.edu/guides/project-mosaic/">https://discovery.cs.illinois.edu/guides/project-mosaic/</a></div>
</h1>

<hr style="color: #DD3403;">

# Section 1: Understanding an Image

In both lecture and then in `lab_favorites`, you learned that a computer represents an image as a 2D grid of **pixels** where each pixel is a single color.  Here is our sample image, **zoomed at 50x**, that shows the nine pixels:

![Color](./notebook-images/sample-50x.png)


## Section 1.1: The `DISCOVERY` library!

In `lab_favorites`, you learned about the DISCOVERY library to help you with image data.  To use it:

- You must `import DISCOVERY` (all caps, because DISCOVERY!)
- Use `image = DISCOVERY.df_image("sample.png")` to load the `sample.png` image a DataFrame

Try that out in the cell below:

In [13]:
import DISCOVERY
image = DISCOVERY.df_image("sample.png")
image

Unnamed: 0,x,y,r,g,b
0,0,0,0,0,255
1,0,1,0,255,0
2,0,2,255,255,255
3,1,0,255,255,255
4,1,1,0,0,0
5,1,2,255,85,46
6,2,0,255,0,0
7,2,1,255,255,255
8,2,2,19,41,75


In the cell below, select the row with the GREEN 🟩 pixel (second row, first column) and store it in the variable `green_pixel`:

In [11]:
green_pixel = image.iloc[1:2]
green_pixel

Unnamed: 0,x,y,r,g,b
1,0,1,0,255,0


In [12]:
# == TEST CASE for Section 1 ==
import DISCOVERY
DISCOVERY.run_test_case_1b(green_pixel)

✅ `green_pixel` contains just one pixel!
✅ `green_pixel` is a green pixel!
🎉 All tests passed! 🎉


<hr style="color: #DD3403;">

# Section 2: Accessing Color Data

Every color visible on a computer screen is made up of the three **primary colors of light** -- red, green, and blue.  Your monitor displays color by varying the intensity of the red light, green light, and blue light emitted for every pixel on your screen.  Since images are primary displayed on computer screens, the **default "color space" is to represent colors as red, green, and blue**.

When you have a pixel, you will always see the three components.  For example, the contents of `green_pixel` is `[0, 255, 0]`:
- The first `0` means we have 0 / 255 (0%) **red** light
- The second `255` means we have 255 / 255 (100%) **green** light
- The final `0` means we have 0 / 255 (0%) **blue** light

## Section 2.1: Accessing Illini Orange

Using the `image` you loaded in Section 1, load the Illini Orange colored pixel into `illini_orange_pixel`:

In [16]:
illini_orange_pixel = image.iloc[5:6]
illini_orange_pixel

Unnamed: 0,x,y,r,g,b
5,1,2,255,85,46


Now we can access the red, green, and blue components by their value in the list.  Once you have a pixel:

- `illini_orange_pixel.r`/`illini_orange_pixel["r"]` is the **column** that contains the red component of the pixel.

Your `illini_orange_pixel` DataFrame only has one row, so we need to convert that **column** into just a value.  There are several ways to do this and you can choose any one of the following:

- One method is to **sum all the values**: `illini_orange_pixel.r.sum()`, will sum up all the `r` values; since there's only one row in the DataFrame, the sum is the value you want.
- Another method is to **access the list of values and pick the index 0 value**: `illini_orange_pixel.r.values[0]` will get all the values for `red` and choose the first one (index `0`).
- ...and many other ways...

Using any method you want, find the red, green, and blue values of the `illini_orange_pixel`:

In [17]:
red = illini_orange_pixel.r.sum()
red

255

In [18]:
green = illini_orange_pixel.g.sum()
green

85

In [19]:
blue = illini_orange_pixel.b.sum()
blue

46

In [20]:
# == TEST CASE for Section 2 ==
DISCOVERY.run_test_case_2(red, green, blue)

✅ `red` is a number!
✅ `red` has the correct value!
✅ `green` is a number!
✅ `green` has the correct value!
✅ `blue` is a number!
✅ `blue` has the correct value!
🎉 All tests passed! 🎉


<hr style="color: #DD3403;">

# Section 3: Finding the Average Color of the Sample Image

Now, find the average color of the **entire** `image` by finding `avg_r`, `avg_g`, and `avg_b`.

*(You've already done this in `lab_favorites`, refer back to your lab if needed.)*

In [21]:
avg_r = image.r.mean()
avg_r

143.77777777777777

In [22]:
avg_g = image.g.mean()
avg_g

127.33333333333333

In [23]:
avg_b = image.b.mean()
avg_b

126.77777777777777

<hr style="color: #DD3403;">

# Section 4: Find the Average Color of ANY Image

Building off Section 3, we now want to find the average color of **ANY** image.  To do that, we need to create a function that is given a DataFrame `image` and returns the average color of that image.

Write this in the function `findAverageColor` below.  You must return the **average color as a dictionary** with the values:
- `avg_r`, for the average red color,
- `avg_g`, for the average green color,
- `avg_b`, for the average blue color

A **dictionary** is a data structure that stores multiple values.  You have used a dictionary in "Step 4" of your simulation code, where you use `d` to accumulate all real-world values.  Accumulate the values `avg_r`, `avg_g`, and `avg_b` just like you would in the simulation:

---
```py
def findAverageImageColor(image):
  ...

  # Return a dictionary of average color:
  d = {"avg_r" : avg_r, "avg_g": avg_g, "avg_b": avg_b}
  return d
```
---

Write the entire `findAverageImageColor` function to find the average color of the `image` passed into the function:

In [27]:
def findAverageImageColor(image):
    avg_r = image.r.mean()
    avg_g = image.g.mean()
    avg_b = image.b.mean()
    d = {"avg_r":avg_r,"avg_g":avg_g,"avg_b":avg_b}
    return d

findAverageImageColor(image)

{'avg_r': 143.77777777777777,
 'avg_g': 127.33333333333333,
 'avg_b': 126.77777777777777}

In [28]:
# == TEST CASE for Section 4 ==
DISCOVERY.run_test_case_4(findAverageImageColor)

✅ Dictionary contain the key `avg_r`.
✅ Dictionary contain the key `avg_g`.
✅ Dictionary contain the key `avg_b`.
✅ The values all appear correct!
🎉 All tests passed! 🎉


<hr style="color: #DD3403;">

# Section 5: Splitting Up Your Base Image

To create a mosaic from an image, we must split the base image into small regions to be replaced with the tile images. To accomplish this, we need a function that will **find the subset of pixels found in a region of an image**.

- Thinking about the 3x3 pixel image `sample.png` (from Section 1), we might need a 2x2 square (or 1x3 rectangle) of pixels instead of using all 3x3 pixels.


### Your `findImageSubset` function

Create a function `findImageSubset` that finds the subset of the image starting at (`x`, `y`), spanning `width` pixels wide and `height` pixels tall. Your function should return the **subset of all the pixels in that region of the image**.

- Example: `findImageSubset(image, x=0, y=0, width=3, height=3)` -- returns subset of all the pixels in the square defined by: x=0...2 and y=0...2 (9 total pixels)

- Example: `findImageSubset(image, x=5, y=5, width=5, height=5)` -- returns subset of all the pixels in the square defined by: x=5...9 and y=5...9 (25 total pixels)

- Example: `findImageSubset(image, x=5, y=0, width=5, height=5)` -- returns subset of all the pixels in the square defined by: x=5...9 and y=0...4 (25 total pixels)

In [31]:
def findImageSubset(image, x, y, width, height):
       subset_df = image[(image['x'] >= x) & (image['x'] < x + width) & 
                   (image['y'] >= y) & (image['y'] < y + height)]
       return subset_df


In [32]:
# == TEST CASE for Section 5 ==
DISCOVERY.run_test_case_5(findImageSubset)

✅ Test case for findImageSubset(image, x=0, y=0, width=2, height=2) appears correct.
✅ Test case for findImageSubset(image, x=2, y=0, width=2, height=2) appears correct.
✅ Test case for findImageSubset(image, x=2, y=2, width=2, height=2) appears correct.
✅ Test case for findImageSubset(image, x=5, y=1, width=2, height=2) appears correct.
✅ Test case for findImageSubset(image, x=5, y=1, width=3, height=2) appears correct.
✅ Test case for findImageSubset(image, x=5, y=1, width=4, height=3) appears correct.
✅ Test case for findImageSubset(image, x=1, y=1, width=1, height=3) appears correct.
🎉 All tests passed! 🎉


<hr style="color: #DD3403;">

# Section 6: Finding the Average Color of a Subset

You have created two functions:

1. A function that finds the average color of a DataFrame of pixels (`findAverageImageColor`), **AND**
2. A function that finds a subset of pixels of an image (`findImageSubset`)

Create a new function, `findAverageImageSubsetColor` that combines both of them and returns the average color of a subset of the image:


In [46]:
def findAverageImageSubsetColor(image, x, y, width, height):
  # Find the subset:
  subset = findImageSubset(image,x,y,width,height)

  # Find and return the average color of the subset:
  return findAverageImageColor(subset)

In [47]:
# == TEST CASE for Section 6 ==
DISCOVERY.run_test_case_6(findAverageImageSubsetColor)

✅ Test case for findAverageImageSubsetColor(image, x=0, y=0, width=2, height=2) appears correct.
✅ Test case for findAverageImageSubsetColor(image, x=2, y=0, width=2, height=2) appears correct.
✅ Test case for findAverageImageSubsetColor(image, x=2, y=2, width=2, height=2) appears correct.
✅ Test case for findAverageImageSubsetColor(image, x=5, y=1, width=2, height=2) appears correct.
✅ Test case for findAverageImageSubsetColor(image, x=5, y=1, width=3, height=2) appears correct.
✅ Test case for findAverageImageSubsetColor(image, x=5, y=1, width=4, height=3) appears correct.
✅ Test case for findAverageImageSubsetColor(image, x=1, y=1, width=1, height=3) appears correct.
🎉 All tests passed! 🎉


<hr style="color: #DD3403;">

# Section 7: Finding the Average Color of Your Tile Images

Before beginning the programming part of this project, you should have set up a directory called `tiles` that contains all of your tile images.  **If you haven't done that, you need to do that now.**

To create an image mosaic, we need to find the average pixel color of every one of our tile images so that you can find the BEST tile image to use when you're creating your mosaic.  The code below is already complete and does the following:

- Goes through each image file in your `tiles` directory,
- Finds the average pixel color of each image using your `findAverageColor` function from Section 4,
- Finally, creates a new DataFrame `df_tiles` with the average color of each image and returns that DataFrame.

*Make sure to run this code -- we'll use it in the very next section!*

In [48]:
import pandas as pd

def createTilesDataFrame(path):
  data = []

  # Loop through all images in the `path` directory:
  for tileImageFileName in DISCOVERY.listTileImagesInPath(path):
    # Load the image as a DataFrame and find the average color:
    image = DISCOVERY.df_image(tileImageFileName)
    averageColor = findAverageImageColor(image)

    # Store the fileName and average colors in a dictionary:
    d = { "fileName": tileImageFileName, "r": averageColor["avg_r"], "g": averageColor["avg_g"], "b": averageColor["avg_b"] }
    data.append(d)

  # Create the `df_tiles` DataFrame:
  df_tiles = pd.DataFrame(data)
  return df_tiles


## Understand the Output

Here's the DataFrame output that the function creates for some sample images.  The DataFrame has four columns:
- `fileName`, containing the name of the tile image
- `r`, `g`, and `b`, containing the average color for the image

In [49]:
createTilesDataFrame("notebook-images")

Unnamed: 0,fileName,r,g,b
0,notebook-images/test3.png,187.625,67.125,46.5
1,notebook-images/test2.png,232.0,74.0,39.0
2,notebook-images/test.png,192.0625,67.8125,45.75
3,notebook-images/sample-100x.png,144.146311,127.815078,127.247289
4,notebook-images/sample.png,143.777778,127.333333,126.777778
5,notebook-images/sample-50x.png,144.146311,127.812889,127.244


<hr style="color: #DD3403;">

# Section 8: Finding the Best Match

There's just **one last function**.  This function needs to find the best tile for a given average color.

To do this, you will use two pieces of data:

1. You will use the DataFrame of all of your tile images that is generated in the previous section.  This will be passed into your function as `df_tiles`.
2. You will use the average color of a subset of your image.  This is passed into your function as `avg_r`, `avg_g`, and `avg_b`.

Using this data, `findBestTile` must **find the best tile image from `df_tiles` for a given average color**.  You should do this by finding the single row in `df_tiles` that has the smallest distance from the average color.

## Example 

Imagine you just have three tiles, so your `df_tiles` DataFrame is:

| fileName | r | g | b |
| -------- | - | - | - |
| red.jpg | 255 | 0 | 0 |
| green.jpg | 0 | 255 | 0 |
| blue.jpg | 0 | 0 | 255 |


If your image subset has `avg_r` = 10, `avg_g` = 200, and `avg_b` = 20, we can use the distance formula (Pythagorean's Theorem) to find how "far away" the average color is from each of the tile images:

1. For the red tile (255, 0, 0), the distance away is $d = \sqrt{(255 - 10)^2 + (0 - 200)^2  + (0 - 20)^2} = 316.8990375$
2. For the green tile (0, 255, 0), the distance away is $d = \sqrt{(0 - 10)^2 + (255 - 200)^2  + (0 - 20)^2} = 59.37171044$
3. For the blue tile (0, 0, 255), the distance away is $d = \sqrt{(0 - 10)^2 + (0 - 200)^2  + (255 - 20)^2} = 308.7474696$

We find that the green tile is the closest since it has the minimum distance.  The green tile should be returned.


## Hints

- It will be **VERY** helpful to add an extra column to your DataFrame -- like `df["distance"] = ...`.
- Once you have the distance calculated, how do you return just the smallest row?

In [56]:
def findBestTile(df_tiles, r_avg, g_avg, b_avg):
  df_tiles['distance'] = ((df_tiles['r'] - r_avg)**2 + (df_tiles['g'] - g_avg)**2 + (df_tiles['b'] - b_avg)**2) ** 0.5
  closest = df_tiles.nsmallest(1,"distance")
  return closest


  

In [57]:
# == TEST CASE for Section 8 ==
DISCOVERY.run_test_case_8(findBestTile)

✅ Test case #1 (r=0, g=0, b=0) passed!
✅ Test case #1 (r=47, g=49, b=38) passed!
✅ Test case #1 (r=54, g=49, b=38) passed!
✅ Test case #1 (r=54, g=49, b=52) passed!
✅ Test case #1 (r=-100, g=-100, b=-100) passed!
🎉 All tests passed! 🎉


<hr style="color: #DD3403;">

# Section 9: Your Mosaic!

Time to put everything together!

First, let's define some variables that you can configure to make your mosaic uniquely yours:

In [58]:
# What is your base image file name?
baseImageFileName = "wishyouwerehere.jpg"

# What folder contains your tile images?
# - You can change this so you can have multiple different folders of tile images.
tileImageFolder = "tilepics"

# What is the maximum number of tiles should your mosaic use across?
# - More tiles across will increase the quality of the final image.
# - More tiles across will cause your program to run slower.
# ...if you have bugs, start this value slow (it won't look great, but it will make it run fast!)
# ...a value around 200 usually looks quite good, but play around with this number!
maximumTilesX = 1800

# What height should your tiles be in your mosaic?
# - A larger tile image will result in a larger output file.
# - A larger tile image will result in your program running slower.
# - A larger tile image will result in more detail in the output file.
tileHeight = 32


## Now create your mosaic!

Run the code to create your mosaic.

- This **WILL SERIOUSLY** take a bit of time (even more time on slower/older laptops).
- This will run fastest if your laptop is plugged in (when it's unplugged, your laptop will try and save power and may not run at full speed).

## Part 1: Generate the `df_tiles` DataFrame from your tile images

⚠️: If you are using **large images** (ex: phone camera images, etc), this **may take many hours to run**.  You may want to shrink them using the notebook we provided:
- We have provided you a notebook `shrink-images.ipynb` that will shrink your tile images to a small size to run faster.
- *(In DISCOVERY, you focused on understanding the algorithm and not using the most efficient data structures; to understand the most efficient data structure, you'll want to take CS 225 to learn about kd-trees.)*.

In [59]:
print(f"Creating `df_tiles` from tile images in folder `{tileImageFolder}`...")
df_tiles = createTilesDataFrame(tileImageFolder)
print(f"...found {len(df_tiles)} tile images!")
df_tiles

Creating `df_tiles` from tile images in folder `tilepics`...
...found 1805 tile images!


Unnamed: 0,fileName,r,g,b
0,tilepics/12543290_743782945758728_2050232560_n...,110.011600,104.845222,101.448234
1,tilepics/929215_352643904900366_812094193_n-9-...,152.923130,127.347299,107.686288
2,tilepics/17267774_227185591020106_415187012684...,143.716875,128.833750,127.769531
3,tilepics/11282781_107601682906321_284282954_n.png,128.718281,137.972500,156.938437
4,tilepics/929215_352643904900366_812094193_n-9-...,96.211562,103.488906,122.779062
...,...,...,...,...
1800,tilepics/1515111_471984069660729_1275938831_n.png,115.369219,111.500000,107.671406
1801,tilepics/929215_352643904900366_812094193_n-9-...,175.402500,125.611094,99.783125
1802,tilepics/929215_352643904900366_812094193_n-9-...,111.265432,135.198495,139.974537
1803,tilepics/929215_352643904900366_812094193_n-9-...,147.431250,128.500313,77.434219


## Part 2: Loading your `baseImage`

⚠️: If you are using **an extremely large image** this may take awhile to run (particularly on older laptops or if you're not plugged in).  You may want to resize your baseImage if this is taking a long time to run.  An image ~1000px across may take anywhere from 10 seconds to 5 minutes.

In [60]:
print(f"Loading your base image `{baseImageFileName}`...")
baseImage = DISCOVERY.df_image(baseImageFileName)
width = baseImage.x.max()
height = baseImage.y.max()

baseImage

Loading your base image `wishyouwerehere.jpg`...


Unnamed: 0,x,y,r,g,b
0,0,0,168,200,247
1,0,1,157,189,236
2,0,2,154,186,233
3,0,3,163,195,242
4,0,4,168,200,247
...,...,...,...,...,...
477139,563,841,104,133,137
477140,563,842,110,141,143
477141,563,843,104,138,137
477142,563,844,90,131,125


## Part 3: Create a mosaic by finding the best match

⚠️: If you are using a large values for `maximumTilesX` (set at the beginning of this section), this may take a long time.

In [61]:
import sys

print(f"Finding best replacement image for each tile...")
# Find the pixelsPerTile to know the pixels used in the base image per mosaic tile:
import math

pixelsPerTile = int(math.ceil(width / maximumTilesX))
width = int(math.floor(width / pixelsPerTile) * pixelsPerTile)
height = int(math.floor(height / pixelsPerTile) * pixelsPerTile)
tilesX = int(width / pixelsPerTile)
tilesY = int(height / pixelsPerTile)

# Create the mosaic:
from PIL import Image
mosaic = Image.new('RGB', (int(tilesX * tileHeight), int(tilesY * tileHeight)))
for x in range(0, width, pixelsPerTile):
  for y in range(0, height, pixelsPerTile):
    avg_color = findAverageImageSubsetColor(baseImage, x, y, pixelsPerTile, pixelsPerTile)
    replacement = findBestTile(df_tiles, avg_color["avg_r"], avg_color["avg_g"], avg_color["avg_b"])

    tile = DISCOVERY.getTileImage(replacement["fileName"].values[0], tileHeight)
    mosaic.paste(tile, (int(x / pixelsPerTile) * tileHeight, int(y / pixelsPerTile) * tileHeight))

  # Print out a progress message:
  curRow = int((x / pixelsPerTile) + 1)
  pct = (curRow / tilesX) * 100
  sys.stdout.write(f'\r  ...progress: {curRow * tilesY} / {tilesX * tilesY} ({pct:.2f}%)')

# Save it
mosaic.save('mosaic-hd.jpg')

# Save a smaller one (for posting):
import PIL
d = max(width, height)
factor = d / 4000
if factor <= 1: factor = 1

small_w = width / factor
small_h = height / factor    
baseImage = mosaic.resize( (int(small_w), int(small_h)), resample=PIL.Image.LANCZOS )
baseImage.save('mosaic-web.jpg')

# Print a message:
tada = "\N{PARTY POPPER}"
print("")
print("")
print(f"{tada} MOSAIC COMPLETE! {tada}")
print("- See `mosaic-hq.jpg` to see your HQ mosaic! (The file may be HUGE.)")
print("- See `mosaic.jpg` to see a mosaic best suited for the web (still big, but not HUGE)!")

Finding best replacement image for each tile...
  ...progress: 475735 / 475735 (100.00%)

🎉 MOSAIC COMPLETE! 🎉
- See `mosaic-hq.jpg` to see your HQ mosaic! (The file may be HUGE.)
- See `mosaic.jpg` to see a mosaic best suited for the web (still big, but not HUGE)!


<hr style="color: #DD3403;">

# Section 10: Verify Your Mosaic Looks Good!

Your mosaic is uniquely yours -- generated by your code, your base image, and your tile images.  Your mosaic must look like an image.

If you find your mosaic does not look like your original image, there are several factors that might cause this:

1. You may not have enough tile images.  You generally need **at least 100 tile images of a large range of colors** to start to create a decent mosaic.  If you want a high quality mosaic, use more tile images.

2. You may have set too few tiles across.  The `maximumTilesX` at the beginning of Section 9 controls the size of your grid.  A value of `maximumTilesX = 100` or higher is usually needed to create a decent mosaic.

3. Your functions may be incorrect.  We provided test cases to try and catch common errors, but we may have missed something!

Add more tiles, change your parameters, or double-check your code and continue to work on this until you have a mosaic you're proud of! :)

<hr style="color: #DD3403;">

# Section 11: Extra Credit

So your mosaic is fantastic -- but I think you can make it can be **even MORE fantastic**!  If you have ideas about how to improve your mosaic, use the following cells to re-program a function or otherwise change your logic and then re-create your mosaic.

If you're not sure and want some inspiration, visit **#project-extra-credit**.  We'll work together on ideas -- after the first week, we will share some of our ideas.  Be sure to look at the **pinned messages** on the channel for suggestions that we find to be really good ways of improving the mosaic!  (We have a few in mind as we write this, but there's probably even more!  You should aim to have fun!)

EC:
for this part I decided to reverse the process of finding the 'best' tile and instead find the picture with the color that deviates the most from the respective tile in the mosiac image👇🏻

In [1]:
def findWorstTile(df_tiles, r_avg, g_avg, b_avg):
  df_tiles['distance'] = ((df_tiles['r'] - r_avg)**2 + (df_tiles['g'] - g_avg)**2 + (df_tiles['b'] - b_avg)**2) ** 0.5
  closest = df_tiles.nlargest(1,"distance")
  return closest


<hr style="color: #DD3403;">

# Section 12: Showcase and Submission

We would **love** for you to share your mosaic!  We have created the discord channel `#project-showcase` just to show off your mosaic -- we hope you will add your `mosaic-web.jpg` there and check out others! :)

## Submission

You're almost done!  All you need to do is to commit your lab to GitHub:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and follow the Canvas instructions to commit this lab to your Git repository!

3. Your TA will grade your submission and provide you feedback after the project is due. :)