# Lesson 0: Setup and Data Preparation

Welcome to the **MiniLlama2-Tutorial**! This notebook sets up your environment and prepares the dataset for building a smaller version of the Llama 2 LLM. We'll install necessary packages and retrieve 100 rows of traditional Chinese (HK) data from the Lihkg dataset.

## Objectives
- Set up the environment (Google Colab or local venv).
- Install required packages (PyTorch, datasets, etc.).
- Download and save the first 100 rows of Lihkg data.

## Prerequisites
- Google Colab (recommended) or a local Python environment (3.8+).
- Internet access to download packages and data.

Let's get started!

In [None]:
# Check Python version
!python --version

## Step 1: Install Required Packages

We'll install the necessary Python packages. On Google Colab, some like PyTorch might be pre-installed, but we'll ensure the latest versions. If you're using a local environment, these commands work in a venv.

Run the cell below to install:
- `torch`: PyTorch for model building.
- `datasets`: Hugging Face library to load Lihkg data.
- `pandas`: For data manipulation.
- `numpy`: For numerical operations.
- `matplotlib`: For visualizations in later lessons.
- `flask`: For deployment in Lesson 8.

In [None]:
# Install packages
!pip install torch datasets pandas numpy matplotlib flask

# My installation on local device with nvidia GPU 4070
# pip  install datasets pandas numpy matplotlib flask ipykernel                                                                            
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

## Step 2: Verify Installation

Let's verify that the packages are installed correctly by importing them.

In [2]:
# Import and verify
import torch
import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import flask

print("PyTorch version:", torch.__version__)
print("Datasets version:", datasets.__version__)
print("Pandas version:", pd.__version__)
print("NumPy version:", np.__version__)
print("Flask version:", flask.__version__)
print("All packages installed successfully!")

PyTorch version: 2.6.0+cu118
Datasets version: 3.3.2
Pandas version: 2.2.3
NumPy version: 2.2.3
Flask version: 3.1.0
All packages installed successfully!


  print("Flask version:", flask.__version__)


## Step 3: Optional - Set Up Virtual Environment (Local Only)

If you're not using Google Colab and want to use a local virtual environment, follow these steps in your terminal (not needed in Colab):

1. Create a virtual environment:
   ```bash
   python -m venv minillama2_env
   ```
2. Activate it:
   - Windows: `minillama2_env\Scripts\activate`
   - Mac/Linux: `source minillama2_env/bin/activate`
3. Install packages (run in terminal after activation):
   ```bash
   pip install torch datasets pandas numpy matplotlib flask
   ```

For Colab users, skip this step as the environment is already set up.

## Step 4: Retrieve Data

In [4]:
# Convert to pandas DataFrame
df = pd.read_csv('data/data.csv')

# Select the first 100 rows
df_sample = df.head(100)

# Display the first few rows
df_sample.head()

Unnamed: 0,user,title,head
0,慘過番印度,是靚午 法國紅酒慢煮阿根廷牛舌 配 煙肉洋蔥炒著仔,法國紅酒慢煮阿根廷牛舌 配 煙肉洋蔥炒著仔#wail#pig\n（$60-5）#wail#p...
1,慘過番印度,是靚午 仙台風燒牛舌定食,仙台風燒牛舌定食#wail#pig\n（$63）#wail#pig\n\n講吓味道先#wai...
2,慘過番印度,衰妹愈大愈有女人味...,是#wail#pig
3,慘過番印度,是靚午 秦式三餸飯,秦式三餸飯#wail#pig\n（$47）#wail#pig\n\n講吓味道先#wail#p...
4,慘過番印度,想將雯雯啲ig story整合做一本書出版,是#wail#pig


## Next Steps

You're all set! Proceed to `Lesson1-Introduction_to_LM_and_Transformers.ipynb` to start learning about language models and transformers.

If you encounter issues:
- Ensure internet connectivity for package and data downloads.
- Check Colab runtime (GPU optional but helpful for training).
- Verify `data/lihkg_sample.csv` exists in the `data/` folder.