This repository is a personal, educational rewrite of Andrej Karpathy's "Let's build GPT: from scratch, in code, spelled out" Python tutorial using pure C++17.
The primary objective of this project is not to create a competitive machine learning framework, but rather to use standard C++ to deeply study what happens behind the scenes of high-level deep learning libraries. By manually translating the Python pipeline into static C++ types, I am practising low-level memory handling, data streaming, and the foundational mathematics of Transformer architectures.
- Data Downloader (Completed): A native implementation using
libcurlandstd::filesystemto automatically download and manage the training corpus (TinyShakespeare) inside a localised directory structure. - Tokeniser (Planned): A character-level map using standard library containers to parse input files into structured integer tokens.
- Math Operations (Planned): Pure C++ implementations of multi-dimensional matrix operations, array strides, and core activation layers such as Softmax and GeLU.
- Autograd and Optimisation (Planned): Manual calculation of gradients and backward passes to practise the core underlying calculus of neural network training.
├── CMakeLists.txt # Build system configuration
├── main.cpp # Project entry point and pipeline coordination
├── download.hpp # Class declaration for the libcurl downloader
├── download.cpp # Downloader implementation details
├── train.cpp # Training loop logic stub
└── .gitignore # Keeps build files and local datasets untracked
To compile and run this project, you will need a compiler that supports C++17 (or higher) and the libcurl development package.
sudo apt update
sudo apt install build-essential cmake libcurl4-openssl-devbrew install cmake curlEnsure CMake is installed and configured. You can use a package manager like vcpkg to obtain curl, or link pre-compiled binaries manually via your toolchain configuration.
- Clone the repository:
git clone https://github.com/localopensource/nanoGPT-Study.cpp.git
cd nanoGPT-Study.cpp- Generate the build system and compile:
mkdir build
cd build
cmake ..
cmake --build .- Execute the programme:
./nanogptNote: On its initial execution, the application will automatically create a datasets/ folder and stream the raw TinyShakespeare text file into it using libcurl.
This project is strictly a tool for self-education and is heavily reliant on the instructional materials provided by the open-source community:
- Inspiration: Andrej Karpathy's nanoGPT and char-rnn repositories.
- Dataset: TinyShakespeare
This study repository adopts the same permissive framework as the original projects. It is open-sourced under the MIT Licence -- see the accompanying LICENSE file for details.