<a href="https://colab.research.google.com/github/obeshor/Data-Science-AI-Assistant/blob/main/Gemma_cpp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'gemma/gemmacpp/2b-it-sfp/1:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-models-data%2F8385%2F10417%2Fbundle%2Farchive.tar.gz%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240325%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240325T182437Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D1a57440d482394a52d58b29ecdadc2d829fa22334361c699295f3f40995e2cb7d4221d6b0fae8784f9481c765ff8ef2a60d9ec63dfc1670c9fcc81b31f175bccde147f663be197200d285c8877d23af50558d17f812dd88f9923628f5efa2b2edb7ce6f6697128c096ca68566493b2fda9ce3866aa9c9695a0bf587884aa83c881b58f56f39535fa859b83e5bdc336112959ae1c45f08832b2a2acc0d71d70f889723e09ed4aeccb13ec0a5ee9bbea564d3a84c6901acf9c575b804be7e115ae69f44c7da9bf42f2deedec618a2956f57be4f8d2622b42221546c58e7ab01aa54123bc3b1b528bfbbb90bb97fa58b83499b0ba0def05f84caea5dc0aa2062754'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


The following bash script (written to disk thanks to the writefile magic) automates the installation and setup process for Gemma C++ executables. Here's a breakdown of each command with the explanation of what it does:

1. `echo "/usr/local/lib" | tee -a /etc/ld.so.conf`: This command appends the directory `/usr/local/lib` to the file `/etc/ld.so.conf`, configuring the dynamic linker run-time bindings.

2. `ldconfig -v`: This command updates the linker cache, allowing it to recognize the new library configurations. The `-v` flag provides verbose output.

3. `conda install cmake -y`: Installs the `cmake` package using the `conda` package manager, with the `-y` flag for automatic confirmation.

4. `conda install -c conda-forge sentencepiece -y`: Installs the `sentencepiece` package from the `conda-forge` channel using `conda`, again with the `-y` flag for automatic confirmation.

5. `git clone https://github.com/google/gemma.cpp`: Clones the repository for the `gemma.cpp` project from GitHub.

6. `cd "gemma.cpp/build"`: Changes the current directory to `gemma.cpp/build`.

7. `cmake ..`: Runs CMake to configure the build system using the `CMakeLists.txt` file in the parent directory (`..`).

8. `make -j 4 gemma`: Initiates the build process using `make`, with the `-j 4` flag indicating to use four parallel jobs for compiling (corresponding to the four cores of a Kaggle Notebook). The `gemma` argument specifies that only the `gemma` target should be built.

9. `cd ../..`: Moves two directories up from the current location.

10. `cp -r gemma.cpp/build ./gemma_cpp`: Copies the `build` directory (containing the compiled `gemma` binary) to a directory named `gemma_cpp` in the current working directory.


In [None]:
%%writefile gemma_cpp.sh

echo "/usr/local/lib" | tee -a /etc/ld.so.conf
ldconfig -v

conda install cmake -y
conda install -c conda-forge sentencepiece -y

git clone https://github.com/google/gemma.cpp
cd "gemma.cpp/build"
cmake ..
make -j 4 gemma
cd ../..
cp -r gemma.cpp/build ./gemma_cpp

Overwriting gemma_cpp.sh


In [None]:
!bash ./gemma_cpp.sh

/usr/local/lib
/sbin/ldconfig.real: Can't stat /usr/local/lib/x86_64-linux-gnu: No such file or directory
/sbin/ldconfig.real: Path `/usr/lib/x86_64-linux-gnu' given more than once
/sbin/ldconfig.real: Path `/usr/local/lib' given more than once
/sbin/ldconfig.real: Path `/usr/local/lib' given more than once
/sbin/ldconfig.real: Path `/lib/x86_64-linux-gnu' given more than once
/sbin/ldconfig.real: Path `/usr/lib/x86_64-linux-gnu' given more than once
/sbin/ldconfig.real: Path `/usr/lib' given more than once
/usr/local/lib:
/lib/x86_64-linux-gnu:
	libpipeline.so.1 -> libpipeline.so.1.5.2
	libjbig.so.0 -> libjbig.so.0
	libwebp.so.6 -> libwebp.so.6.0.2
	libhdf5_serial.so.103 -> libhdf5_serial.so.103.0.0
	libxcb.so.1 -> libxcb.so.1.1.0
	libnpth.so.0 -> libnpth.so.0.1.2
	libhdf5_hl_cpp.so.100 -> libhdf5_hl_cpp.so.100.1.2
	libbfd-2.34-system.so -> libbfd-2.34-system.so
	libelf.so.1 -> libelf-0.176.so
	libasound.so.2 -> libasound.so.2.0.0
	librest-0.7.so.0 -> librest-0.7.so.0.0.0
	liblqr-1.so

In [None]:
!rm -rf "gemma.cpp"
!rm -f "gemma_cpp.sh"

In [None]:
!echo "Can you explain linear regression?" | ./gemma_cpp/gemma -- --tokenizer /kaggle/input/gemma/gemmacpp/2b-it-sfp/1/tokenizer.spm --compressed_weights /kaggle/input/gemma/gemmacpp/2b-it-sfp/1/2b-it-sfp.sbs --model 2b-it --verbosity 0


[ Reading prompt ] ..............
Sure, here's a breakdown of linear regression:

**What is it?**

Linear regression is a statistical method used to find a straight line that best fits a set of data points. It's a powerful tool for understanding relationships between variables and making predictions based on new data points.

**How does it work?**

1. **Data preparation:** You start by collecting data points, which can be represented in a table or a graph.
2. **Forming the model:** You then create a mathematical equation that expresses the relationship between the dependent and independent variables.
3. **Fitting the model:** Using a statistical software package, you find the values of the coefficients that minimize the difference between the actual data points and the predicted values based on the model.
4. **Interpretation:** The coefficients tell you how much each independent variable affects the dependent variable.
5. **Prediction:** Once you have the model, you can use it to pred

In [None]:
import subprocess
import sys

class GemmaCPP():
    """Wrapper for the C++ implementation of Gemma"""

    def __init__(self, gemma_cpp, tokenizer, compressed_weights, model):
        self.gemma_cpp = gemma_cpp
        self.tokenizer = tokenizer
        self.compressed_weights = compressed_weights
        self.model = model

    def generate_text(self, prompt):
        """Generate text using the cpp tokenizer and model"""
        # Define the shell command
        prompt = prompt.replace('"', '\"')
        shell_command = f'echo "{prompt}" | {gemma_cpp} -- --tokenizer {tokenizer} --compressed_weights {compressed_weights} --model {model} --verbosity 0'

        # Execute the shell command and redirect stdout to the Python script's stdout
        process = subprocess.Popen(shell_command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

        # Print the output letter by letter
        for char in iter(lambda: process.stdout.read(1), b''):
            sys.stdout.write(char.decode(sys.stdout.encoding))
            sys.stdout.flush()

        # Wait for the process to finish
        process.wait()

In [None]:
gemma_cpp = "./gemma_cpp/gemma"
tokenizer = "/kaggle/input/gemma/gemmacpp/2b-it-sfp/1/tokenizer.spm"
compressed_weights = "/kaggle/input/gemma/gemmacpp/2b-it-sfp/1/2b-it-sfp.sbs"
model = "2b-it"

gemma =  GemmaCPP(gemma_cpp, tokenizer, compressed_weights, model)

In [None]:
%%time
gemma.generate_text("Explain why the sky is blue")


[ Reading prompt ] ..............
The sky appears blue due to Rayleigh scattering. Rayleigh scattering is the scattering of light by particles of a smaller wavelength, such as blue and violet light. This is because blue and violet light have shorter wavelengths than other colors of light.

When sunlight enters the atmosphere, it is scattered in all directions. However, blue and violet light have shorter wavelengths and are scattered more strongly than longer wavelengths. This is because blue and violet light have more energy and are more likely to be scattered.

As a result, the sky appears blue to us. The blue color of the sky is also responsible for the beautiful colors of the rainbow.

CPU times: user 780 ms, sys: 138 ms, total: 919 ms
Wall time: 57.8 s


In [None]:
%%time
gemma.generate_text("I am working with the Titanic dataset, \
containing the features: PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked \
suggest some feature engineering")


[ Reading prompt ] ............................................. engineeringtechniques to improve the predictive power of the model.

1. **One-Hot Encoding for categorical variables**:
   - Create new features by creating a new column for each categorical variable, with a value of 1 for the instance and 0 for the rest.

2. **Feature Scaling**:
   - Scale the numerical features (Age, SibSp, Parch, Fare) to a range between 0 and 1.

3. **Feature Transformation**:
   - Apply a logarithmic transformation to the "Age" feature.
   - Apply a square root transformation to the "Fare" feature.

4. **Creating new features**:
   - Create new features by combining existing features, such as "Sex_Male" and "Sex_Female".

5. **Using Regularization Techniques**:
   - Apply L1 regularization to penalize large coefficients and encourage sparsity in the model.

6. **Using Feature Importance**:
   - Calculate feature importance to understand which features contribute most to the model's performance.

7. 

In [None]:
%%time
gemma.generate_text("Show me how to use in Python a XGBoost classifier")


[ Reading prompt ] ............ classifier.

```python
import xgboost as xgb

# Load the XGBoost library
XGB = xgboost.XGBRegressor()

# Define the training data
X_train = ...
y_train = ...

# Train the XGBoost classifier
XGB.fit(X_train, y_train)

# Make predictions on new data
X_test = ...
y_pred = XGB.predict(X_test)
```

**Example:**

```python
import xgboost as xgb

# Load the XGBoost library
XGB = xgboost.XGBRegressor()

# Load the training data
X_train = xgb.datasets.load_data('xgb_data.csv')
y_train = xgb.datasets.load_data('xgb_target.csv')

# Train the XGBoost classifier
XGB.fit(X_train, y_train)

# Make predictions on new data
X_test = xgb.datasets.load_data('xgb_test.csv')
y_pred = XGB.predict(X_test)

# Print the results
print('Predicted labels:', y_pred)
```

**Notes:**

* Replace `xgb_data.csv` and `xgb_target.csv` with the actual file names of your training and target data.
* You can adjust the hyperparameters of the XGBoost classifier, such as the learning rate and nu

In [None]:
%%time
gemma.generate_text("What are the key differences between gradient boosting and random forests?")