# COGS 188 - Project Proposal

# Project Description

You have the choice of doing either (1) an AI solve a problem style project or (2) run a Special Topics class on a topic of your choice.  If you want to do (2) you should fill out the _other_ proposal for that. This is the proposal description for (1).

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like 8-Queens or a small Traveling Salesman Problem or similar
- If its the kind of problem (e.g., RL) that interacts with a simulator or live task, then the problem will have a reasonably complex action space. For instance, a wupus world kind of thing with a 9x9 grid is definitely too small.  A simulated mountain car with a less complex 2-d road and simplified dynamics seems like a fairly low achievement level.  A more complex 3-d mountain car simulation with large extent and realistic dynamics, sure sounds great!
- If its the kind of problem that uses a dataset, then the dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training an unsupervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.
- The project must include some elements we talked about in the course
- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your AI system. Generally RL tasks may require a huge amount of training, so extensive grid search is unlikely to be possible. However expoloring a few reasonable hyper-parameters may still be possible. 
- You will evaluate the performance of your AI system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

# Names

Hopefully your team is at least this good. Obviously you should replace these with your names.

- Katy Stadler
- Elvin Li
- Jiasheng Zhou
- Harry Wang

# Abstract 
This section should be short and clearly stated. It should be a single paragraph <200 words.  It should summarize: 
- what your goal/problem is
- what the data used represents and how they are measured
- what you will be doing with the data
- how performance/success will be measured

Our goal is to create a sketch-to-image tool that takes in a user-inputted sketch and outputs an image based on the sketch in a \<certain style\>. We will use a variety of datasets to test our model's capability. One of our datasets will be a large dataset (13.9K observations) of stills taken from Avatar the Last Airbender, an animated show with a recognizable style. Another is SketchyDatabase, a dataset containing sketch-and-realistic-image pairs. The data will need to be translated into numpy arrays for easier manipulation, have normalized RGB values in the range of [0,1], and will be standardized into a fixed dimension. The data will also be partitioned into training, validating, and testing sets for experimental purposes. We will be using edge detection to create "sketch" versions of the data for the Avatar dataset, whilst SketchyDatabase already has the sketch and image pairs. We will experiment with a variety of deep learning and diffusion models in attempt to reconstruct the image style from the sketches, and use standard computer vision metrics such as Structural Similarity Index (SSIM) or Fréchet Inception Distance (FID) to evaluate how well our model's reconstructed image compares to the ground-truth.

# Background

<!--
Fill in the background and discuss the kind of prior work that has gone on in this research area here. **Use inline citation** to specify which references support which statements.  You can do that through HTML footnotes (demonstrated here). I used to reccommend Markdown footnotes (google is your friend) because they are simpler but recently I have had some problems with them working for me whereas HTML ones always work so far. So use the method that works for you, but do use inline citations.

Here is an example of inline citation. After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds<a name="lorenz"></a>[<sup>[1]</sup>](#lorenznote). Use a minimum of 3 to 5 citations, but we prefer more <a name="admonish"></a>[<sup>[2]</sup>](#admonishnote). You need enough citations to fully explain and back up important facts. 

Remeber you are trying to explain why someone would want to answer your question or why your hypothesis is in the form that you've stated. 
-->

In order to process images to train complicated generative models that we see today, images need to have features extracted. One way to do this is with edge detection, which is used to identify the boundaries of objects within images <a name="edgedetectnote"></a>[<sup>[1]</sup>](#edgedetectnote). The Sobel Operator uses two 3x3 convolution kernels to calculate the gradient in vertical and horizontal directions, finding the direction of the largest increase from light to dark and the rate of change in that direction <a name="isotropicnote"></a>[<sup>[2]</sup>](#isotropicnote). Later on came the Canny edge detector, which builds on the Sobel Operator and is a "simple approximimate implementation in which edges are marked at maxima in gradient magnitude of a Gaussian-smoothed image" <a name="cannynote"></a>[<sup>[3]</sup>](#cannynote). Previous work like this will help us with feature detection and also help us create our dataset with sketch-image pairs.

There are several sketch-to-image models based around deep learning. Pix2Pix's goal is to transform one type of image into another (for example, converting a simple sketch into a realistic image) <a name="isolanote"></a>[<sup>[4]</sup>](#isolanote). It uses a generator, which takes an imput image and generates an output image, and a discriminator, which evaluates whether the generated output is from the input dataset or generated by the model. Then it trains the two components together with a minimax framework: the generator tries to minimize the difference between its generated input and the real image, and the discriminator maximizes its ability to correctly classify images <a name="goodfellownote"></a>[<sup>[5]</sup>](#goodfellownote). Many people have used Pix2Pix and trained it on different datasets to create different results, such as this edges2...cats, shoes, handbags, and more <a name="demonote"></a>[<sup>[6]</sup>](#demonote) project, which produces (sometimes worrying) cat-like pictures from a basic sketch, or this Doodles to Pictures! <a name="pixnote"></a>[<sup>[7]</sup>](#pixnote) project which translates sketches in based on different models like cats, birds, lollipops, snakes, and more.

Gunjate et al. (2023) <a name="gunjatenote"></a>[<sup>[8]</sup>](#gunjatenote) provides a generative adversarial network method for creating convincing image. We start with a random sketch image, which is lacking in details an usually grayscale. Then the generator network transforms the sketch into a data instance, and the discriminator classifies the generated data. Discriminated loss penalizes the discriminator for misidentifying the data, then updates its weights through backpropogation. The generator loss penalizes the generator for failing to "trick" the discriminator. 

For our project, we want to build on these advancements and work towards two goals: first, we want to learn how these tools work down to the implementation level, using what we have learned in this class to implement some of the model ourselves and refine the model. Second, we want to apply a novel style for our sketches to be translated into, just as previous applications have turned sketches into cat-like images. 

# Problem Statement

Clearly describe the problem that you are solving. Avoid ambiguous words. The problem described should be well defined and should have at least one ML-relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms), measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).  

The goal of this project is to develop a deep learning model that translates hand-drawn sketches into images of a specific style, bridging the gap between abstract line drawings and visual representations. Sketches serve as simplified depictions of objects, often omitting texture, color, and intricate details. The challenge lies in transforming these sketches into images that not only preserve structural integrity but also adhere to a chosen style, whether it be photorealistic, in the style of Avatar, or any domain.

This problem can be framed as a self-supervised learning task where the model learns a mapping function between the sketch domain and a target image domain. The model's effectiveness can be evaluated using various metrics. Structural Similarity Index (SSIM) measures the preservation of structural details, while Fréchet Inception Distance (FID) evaluates how closely generated images match the distribution of the target style. Learned Perceptual Image Patch Similarity (LPIPS) provides a perceptual assessment of similarity, ensuring that generated images align with human visual expectations. Additionally, retrieval-based metrics such as Recall@K can assess how accurately the model generates images that fit the intended domain. The problem is also replicable, as the datasets we chose like the Sketchy Database and style-specific datasets provide a consistent framework for training and evaluation.

Several machine learning approaches can be used to address this problem in recent years. Conditional Generative Adversarial Networks (cGANs), such as Pix2Pix, can generate images conditioned on input sketches while ensuring they adhere to a specific style. Diffusion models, such as Stable Diffusion with ControlNet, offer fine-grained control over image synthesis by guiding the diffusion process based on sketch constraints. Autoencoder-based models, such as Variational Autoencoders (VAEs) or U-Nets, can learn meaningful representations of sketches and transform them into stylized outputs. Style transfer techniques can also be integrated to refine the generated images to better match a particular aesthetic. Each of these methods provides a unique approach to the problem, depending on the desired balance between fidelity to the sketch and adherence to the target style.

This project aims to enhance sketch-to-style image translation by leveraging deep learning techniques to generate high-quality, stylized images from sketches. By investigating different modeling approaches and evaluating performance using rigorous metrics, we hope to develop an effective sketch-to-image translation model.

# Data

You should have a strong idea of what dataset(s) will be used to accomplish this project. 

If you know what (some) of the data you will use, please give the following information for each dataset:
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc will be needed

If you don't yet know what your dataset(s) will be, you should describe what you desire in terms of the above bullets.

Dataset: https://huggingface.co/datasets/lumenggan/avatar-the-last-airbender-tagged
- 13.9K rows. Each row is a still image from Avatar the Last Airbender (an animated show) tagged with descriptions of the scene. Key variables will be the features that describe an image, and they will have to be extracted from the images in the dataset. The most important will be the pixel values of the image, either in color (RGB) or grayscale. We will also need to apply edge detection to the developed image to generate sketch representations (for training data). In terms of special handling, transformation, cleaning, etc, we will filter out low-quality images, convert images to RGB and normalize pixel values. We will also have to standardize the size of the images and make sure that once we have applied edge detection we have sketch and image pairs matched.  

Dataset: https://github.com/CDOTAD/SketchyDatabase?tab=readme-ov-file  
- 12.5k images. Each image has the original photographic image associated, and contains different sketches of the image as drawn by a human, totalling 75,471 sketches. The photos are miscellaneous ranging from animals such as kangaroos, elephants, and geese to everyday objects like teapots or an alarm clock. Since the sketch-image pairs are already present, we will process the dataset by standardizing RGB values and the image sizes. 

There are also several other datasets with already-matched sketch-image pairs that will also result in a different style that our model will produce.

# Proposed Solution (Help)

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Why might your solution work? Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).  


To evaluate the performance of both the benchmark and solution models for sketch-to-style translation, Fréchet Inception Distance (FID) and Structural Similarity Index (SSIM) can be used. FID measures how similar the distribution of generated images is to real images by computing the Fréchet distance between their feature representations in a pretrained Inception network. It is defined as:

$$
FID = ||\mu_r - \mu_g||^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2\sqrt{\Sigma_r \Sigma_g})
$$

where $\mu_r, \Sigma_r$ are the mean and covariance of real images, and $\mu_g, \Sigma_g$ are for generated images. Lower FID indicates that the generated images better match the target style. SSIM, on the other hand, evaluates pixel-wise structural similarity and is given by:

$$
SSIM(x, y) = \frac{(2\mu_x \mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}
$$

where $\mu_x, \mu_y$ are mean intensities, $\sigma_x^2, \sigma_y^2$ are variances, and $\sigma_{xy}$ is the covariance. Higher SSIM values indicate greater structural similarity between the generated and reference images. FID captures perceptual quality and style consistency, while SSIM ensures structural integrity, making them complementary for evaluating the effectiveness of the model in preserving sketch details while achieving the desired style transformation.


# Ethics & Privacy

If your project has obvious potential concerns with ethics or data privacy discuss that here.  Almost every ML project put into production can have ethical implications if you use your imagination. Use your imagination. Get creative!

Even if you can't come up with an obvious ethical concern that should be addressed, you should know that a large number of ML projects that go into producation have unintended consequences and ethical problems once in production. How will your team address these issues?

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

# Team Expectations 

* *Communications through group chat via Instagram and email. Expected to be addressed within 24 hours.*
* *We will create a schedule on sheets to keep track of course deadlines and self-set deadlines, with assignments for a specific member if applicable.*
* *It is understand that members may have different strengths/expertise, and it is expected that everyone contributes substantially to the project so that the work is evenly divided.*
* *Team or scheduling issues will be addressed in a group meeting, either in person or via Zoom. We prefer for in-depth issues to be discussed face-to-face.*
* *Week-to-week goals will be set and there will be weekly checkins to update the spreadsheet and check in on progress.*

# Project Timeline Proposal

Replace this with something meaningful that is appropriate for your needs. It doesn't have to be something that fits this format.  It doesn't have to be set in stone... "no battle plan survives contact with the enemy". But you need a battle plan nonetheless, and you need to keep it updated so you understand what you are trying to accomplish, who's responsible for what, and what the expected due dates are for each item.

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/12  |  6:30 PM |  Brainstorm topics/questions (all)  | Determine best form of communication, decide on project topic | 
| 2/14  |  11:59 PM |  Research datasets, do background research, define the problem | Complete and turn in project proposal | 
| 2/19  | 6:30 PM  | Finalize datasets, preprocessing, start to outline implementation  | Discuss implementation outline, assign parts among the group   |
| 2/26  | 6:30 PM  | Data preprocessing, feature extraction | Make sure work is combined seamlessly. Discuss evaluation metrics  |
| 3/05  | 6:30 PM  | Initial model implementation | Fine-tune hyperparameters and improve model performance |
| 3/12  | 6:30 PM  | Create material for the report, continue working on fine tuning the model | Discuss what we want the report to look like and divide work |
| 3/15  | 6:30 PM  | Work on work that was assigned per person | Go over the report and refine! |
| 3/19  | 6:30 PM  | NA | One last meeting before we turn in the project to make sure we are happy with everything |

# Footnotes
<!--
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
-->


<a name="edgedetectnote"></a>1.[^](#edgedetectnote): GeeksforGeeks. (2024, July 22). What is Edge Detection in Image Processing? GeeksforGeeks. https://www.geeksforgeeks.org/what-is-edge-detection-in-image-processing/  
<a name="isotropicnote"></a>2.[^](#isotropicnote): Sobel, Irwin and G. M. Feldman. “An Isotropic 3×3 image gradient operator.” (1990).  
<a name="cannynote"></a>3.[^](#cannynote): A computational approach to edge detection. (1986, November 1). IEEE Journals & Magazine | IEEE Xplore. https://ieeexplore.ieee.org/document/4767851  
<a name="isolanote"></a>4.[^](#isolanote): Isola, P., Zhu, J., Zhou, T., & Efros, A. A. (2016, November 21). Image-to-Image Translation with Conditional Adversarial Networks. arXiv.org. https://arxiv.org/abs/1611.07004  
<a name="goodfellownote"></a>5.[^](#goodfellownote): Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014, June 10). Generative adversarial networks. arXiv.org. https://arxiv.org/abs/1406.2661  
<a name="demonote"></a>6.[^](#demonote): Image-to-Image Demo - Affine layer. (n.d.). https://affinelayer.com/pixsrv/  
<a name="pixnote"></a>7.[^](#pixnote): Pix2Pix Image Transfer Activity. (n.d.). https://mitmedialab.github.io/GAN-play/  
<a name="gunjatenote"></a>8.[^](#gunjatenote): Sumit Gunjate; Tushar Nakhate; Tushar Kshirsagar; Yash Sapat;. "Sketch to Image using GAN." Volume. 8 Issue. 1, January - 2023 , International Journal of Innovative Science and Research Technology (IJISRT), www.ijisrt.com. ISSN - 2456-2165, PP :- 772-777. https://doi.org/10.5281/zenodo.7588232