# Module 1 Project

## Project overview

In this project, you will choose a dataset and problem based on your specialization or interests in Electrical and Computer Engineering.  You will decide how you want to solve the problem using the given dataset, then create the code and writeup necessary for others to understand how you solved the problem.  You then have the option of using [Canvas Studio](https://community.canvaslms.com/t5/Studio/How-do-I-record-a-Canvas-Studio-screen-capture-video-in-a-course/ta-p/1713) to record a video presentation of the problem and your solution, walking through your thought process and code.  If you choose not to make a recording, then you must write your thought process and conclusions in detail within Markdown cells.

## Project rubric

The project is graded out of 10 points. Points are awarded as follows:

- 2 points possible for a correct statistical interpretation of the problem
    - In other words, make sure you state at the top the type of statistical analysis 
      you will use to solve the problem and why you chose to perform this type of analysis
      given the dataset and the problem description.
- 2 points possible for the correct statistical analysis flow implemented based on interpretation
    - In other words, make sure that the analytical flow you implement throughout your notebook and 
      video is consistent with the type of statistical analysis you chose to solve the problem.
- 2 points possible for clear Python code logic and intention
    - In other words, make sure that you utilize Markdown and code cells to explain the steps you are
      taking with code that is written.  For example, if you have a code cell that adds a new column
      to a DataFrame (such as when using dummy variables), make sure you explain why you are adding
      this new column.
- 2 points possible for clear, readable Python code organization
    - In other words, if you have any code that needs to be repeated, try to create functions rather
      than copy/pasting the same code multiple times. Your code does not need to be perfect, but it 
      should be readable and easily followed. If you create a function, make sure to include the
      function header so that others can understand the purpose and parameters/returns of your function.
- 2 point possible for the audio-video presentation or detailed Markdown text explanations
    - In other words, if you do not want to record a video explanation, then you must have your
      statistical analysis and corresponding Python code well-explained on paper.
      
If any academic dishonesty is found (any similarities between submitted work or work posted on the Internet that are unexplained), you will receive zero points on this assignment, fail the course, and be referred to the academic dishonesty office, which may result in grounds for dismissal from the MSEE program at Cal State LA.  

In other words, do your own work, make sure any analyses you run are justified in explanation either in writing or on the video, and do not work with others for this assignment.

## Select a problem based on your specialization in ECE:

Scroll down until you see one of your specializations (sorted alphabetically) and choose a problem.

Then, delete the text of all of the other problems, leaving just the problem you are working to solve below.


**Biomedical engineering**: Body fat measurement mechanism design

_Dataset:_ `project_datasets/bme_bodyfat.csv`

_Dataset description:_ This is a dataset of humans and different measurements on their body. Columns:
1. Density (target, density of subject in grams per cubic centimeter)
2. BodyFat (target, body fat percentage based on density)
3. Age (in years)
4. Weight (in pounds)
5. Height (in inches)
6. Neck (circumference in cm)
7. Chest (circumference in cm)
8. Abdomen (circumference in cm)
9. Hip (circumference in cm)
10. Thigh (circumference in cm)
11. Knee (circumference in cm)
12. Ankle (circumference in cm)
13. Biceps (circumference in cm)
14. Forearm (circumference in cm)
15. Wrist (circumference in cm)

_Problem:_ Is it possible to estimate body fat percentage using the given data without using density?  If so, which features matter?  Do the features that matter make sense? Why or why not?


**Communications/electronics engineering**: Antenna design

_Dataset:_ `project_datasets/comm_antenna.csv`

_Dataset description:_ This is a dataset of different antenna designs. Columns:
1. TestFreq (frequency used for testing the signal strength)
2. PatchLength (length of patch antenna in mm)
3. PatchWidth (width of patch antenna in mm)
4. SlotLength (length of slot in antenna in mm)
5. SlotWidth (width of slot in antenna in mm)
6. Strength (signal strength in dB, higher is better)

_Problem:_ Is it possible to create a statistical model that can estimate signal strength based on these parameters? Additionally, is it possible to create a model that only use the parameters that are not the test frequency?  What are the best accuracies of your statistical models?


**Computer engineering**: GPU workload design

_Dataset:_ `project_datasets/computer_gpu.csv`

_Dataset description:_ This is a dataset from running an OpenCL benchmark on an AMD GPU. This benchmark breaks up matrix math by resizing a very large matrix (or large number of matrices) into matrices that can fit in hardware memory and cache by using a third dimension. Columns:
1. Workgrp_m (Workgroup size (number of compute units) used for first dimension of matrices)
2. Workgrp_n (Workgroup size (number of compute units) used for second dimension of matrices)
3. Workgrp_k (Workgroup size (number of compute units) used for third dimension of matrices)
4. Local_m (Local workgroup size (number of kernels running on one compute unit) used for first dimension of matrices)
5. Local_n (Local workgroup size (number of kernels running on one compute unit) used for second dimension of matrices)
6. Mem_m (Local memory length used for first dimension of matrices)
7. Mem_n (Local memory length used for second dimension of matrices)
8. Kernel_unroll (Number of times loops are unrolled)
9. VectorWidth_m (Width of vector instruction used for first dimension of matrices)
10. VectorWidth_n (Width of vector instruction used for second dimension of matrices)
11. Stride_m (Use of off-chip memory for the first dimension of matrices)
12. Stride_n (Use of off-chip memory for the second dimension of matrices)
13. Cache_A (Use of caching scheme A)
14. Cache_B (Use of caching scheme B)
15. Runtime (target, runtime in ms)


_Problem:_ For this GPU, which features of the OpenCL benchmark seem to affect runtime the largest? Which statistical modeling technique works best to predict the runtime based on this data? If so, which features are the most important for your model?


**Power Engineering**: Solar power generator

_Dataset:_ `project_datasets/power_solarplant.csv`

_Dataset description:_ This is a dataset from a solar power plant. Columns:
1. DateTime (Date and time in MM/DD/YYYY HH:MM format)
2. AmbientTemp (Plant-wide ambient temperature in degC)
3. ModuleTemp (Mean panel temperature in degC)
4. Irradiation (irradiance at measurement time in W/m^2)
5. PowerDC (DC power generated by this plant in kW)
6. PowerAC (AC power generated by this plant in kW)

_Problem:_ Can ambient temperature, module temperature, and irradiation provide a reasonable estimate for the DC power generated by this solar power plant?  Does your answer to the former question make sense?  Why or why not?  What is the maximum accuracy of your statistical model?