## Summary

LLMs have tunable parameters that change the way they predict next tokens. We can experiment with them in [OpenAI's playground](https://platform.openai.com/playground/) or [together.ai's playground](https://docs.together.ai/docs/inference-web-interface)

- `temperature` is a hyperparameter that controls the randomness of the model’s output.
  - It controls the 'flatness' of the distribution curve.

    - Low temperature (e.g. 0.0 → 0.3)
        - More deterministic, focused, repetitive
        - The model picks the most probable next token every time
        - Great for:
            - Factual tasks
            - Summaries
            - Code generation

    - Medium temperature (e.g. ~0.5 → 0.7)
        - Balanced creativity and accuracy
        - Useful for:
            - Creative writing with some factual grounding
            - Brainstorming with reasonable coherence

    - High temperature (e.g. 0.8 → 1.5 or more)
        - More randomness and variation
        - The model explores less likely tokens
        - Useful for:
            - Creative storytelling
            - Generating diverse ideas
            - Humor, poetry

    At temperature = 0 (greedy decoding), the model becomes deterministic: the same prompt yields the same output every time.

- `max-tokens` - controls the width of the `Attention Window` of the LLM
  - Attention Window comprises of: 
    - System Prompt
    - Chat History
    - User Prompt
    - max_tokens 
  
- `top-P` - controls greedy decoding 
  - aka `top-K`
  - top-P truncates distribution curve below some value P


## Additional References

Hugging Face has a wonderful [blog](https://huggingface.co/blog/how-to-generate) further explaining these decoding parameters. Also, see documentation details for [OpenAI's API](https://platform.openai.com/docs/api-reference/chat/create) and [Together AI's API](https://docs.together.ai/docs/inference-parameters).