# Project 1 Tips and Questions
### IMPORTANT PLEASE READ THIS
First and foremost, you want to be familiar with the homepage https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Read through the relevant data dictionaries:
- **MUST READ:** https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf
- https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
- https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf
- https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_fhv.pdf
- https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_hvfhs.pdf

Why? Your tutors can be treated as "experts" in this field. To prepare you for the Industry Project, we need to assess students on adhering to requirements and business rules. 

The tutor team knows this dataset inside out. If you are incorrectly filtering records without sufficient justification, you will be losing marks as per requirements.

### An Incorrect Example
- Scenario: Student does analysis on `tip_amount` and finds several `NULL` values and either drops them or includes it in the analysis. Later on, they use a regression model to predict this value.

- Result: According to the data dictionary, `tip_amount` is automatically populated for credit card tips (`payment_type` is `1`). Cash tips are not included. This means that the students' analysis included all payment types despite this field clearly specifying the rule. 

- Penalty: The student will lose marks on the analysis section. The modelling section will be marked _assuming_ they got this filtering method correct. However, if another issue pops up due to this, there will be another penalty applied. Please get this right!

- Solution: Student should filter for only `payment_type=1` and now, the student can (hopefully) conduct correct analysis on `tip_amount`.

Several students over the past few years have lost many marks for simple rules like this (especially `tip_amount`).
### Readable Code
- We will be assessing the quality of your code and how you present it in your notebooks. 
- This is because there is no point writing code that cannot be easily interpreted. At the end of the day, employers and clients are not only paying for your analysis, but also the corresponding code. 
- If your code is confusing or difficult to read, there is little chance your client will come back to you.

**Variable Names:**  
As long as you are consistent, then it is fine. For example, commit to either using:
- Snake Case: words are seperated by underscores such as `variable_name`
- Camel Case: words are seperated by captials such as `variableName`

Your variables should be contextual and describe the code. That is, try to name your variables to be understandable **without comments**.

**Comments and Docstrings (w.r.t JupyterNotebook Cells):**  
Cells in Jupyter Notebook should aim to do one "block of logic" at a time (i.e importing libraries, defining functions, filtering rows, etc).
- If it takes a reader more than a few seconds to understand your cell, you need comments.
- Your functions need to have docstrings describing what they do. If you forgot, search it online or go visit your COMP10001 Grok course.
- Use markdown cells for longer comments or explaining logic, inline comments in code for short descriptions of hard-to-understand code.

We won't ask you to run `flake8` or `pylint` on your notebooks. We just ask for good comments in the code and markdown cells, reasonable variable names, and clean directories.

Here is a good example of good docstring + comments for functions.

```python
def some_function(some_val: str) -> str:
    """
    This function takes in some string value
    and outputs some string value via some transformation.
    """
    # make sure the casing is correct
    new_val = some_val.casefold()
    return new_val
```