## Exercise 1: Analyze Book Compression
- **Import the GPT-4 tokenizer.**
- Import several books using their URLs.
- Calculate **compression ratio** as `tokens / characters`.
- Display the results in a **table format** for easy comparison.

In [25]:
# Import the libs

import numpy as np
import requests
import tiktoken
from urllib.parse import urlparse
import string

In [2]:
gbt4_tokenizer = tiktoken.get_encoding("cl100k_base")
gbt4_tokenizer.n_vocab

100277

In [3]:
# all books have the same url format;
# they are unique by numerical code
baseurl = 'https://www.gutenberg.org/cache/epub/'

bookurls = [
    # code       title
    ['84',    'Frankenstein'    ],
    ['64317', 'GreatGatsby'     ],
    ['11',    'AliceWonderland' ],
    ['1513',  'RomeoJuliet'     ],
    ['76',    'HuckFinn'        ],
    ['219',   'HeartDarkness'   ],
    ['2591',  'GrimmsTales'     ],
    ['2148',  'EdgarAllenPoe'   ],
    ['36',    'WarOfTheWorlds'  ],
    ['829',   'GulliversTravels']
]

In [4]:
infos = np.zeros( (len(bookurls),3) )

for idx,(c,b) in enumerate(bookurls):
    
    text = ( requests.get(baseurl+c+"/pg"+c+".txt") ).text
    tokens = gbt4_tokenizer.encode(text)

    compression = len(tokens)/len(text)

    infos[idx,1] = len(tokens)
    infos[idx,0] = len(text)
    infos[idx,2] = compression*100

In [5]:
print("| Book Title       | Characters  | Tokens    | Compression |\n"+"-"*60)

for i in range(10):
    print(f"| {bookurls[i][1]:16} | {int(infos[i,0]):<11,} | {int(infos[i,1]):<8,}  | {infos[i,2]:>10.2f}% |\n")

| Book Title       | Characters  | Tokens    | Compression |
------------------------------------------------------------
| Frankenstein     | 446,544     | 102,419   |      22.94% |

| GreatGatsby      | 296,858     | 70,343    |      23.70% |

| AliceWonderland  | 167,674     | 41,457    |      24.72% |

| RomeoJuliet      | 167,426     | 43,761    |      26.14% |

| HuckFinn         | 602,714     | 159,125   |      26.40% |

| HeartDarkness    | 232,885     | 56,483    |      24.25% |

| GrimmsTales      | 549,736     | 137,252   |      24.97% |

| EdgarAllenPoe    | 632,131     | 144,315   |      22.83% |

| WarOfTheWorlds   | 363,420     | 84,580    |      23.27% |

| GulliversTravels | 611,742     | 143,560   |      23.47% |



## Exercise 2: Analyze Website Compression
- Use websites as input text data.
- **Tip:** Use `urllib.parse` to clean and process the URLs.
- Compute and display compression ratios for different websites in a table.

In [7]:
weburls = [
    'http://python.org/',
    'https://pytorch.org/',
    'https://en.wikipedia.org/wiki/List_of_English_words_containing_Q_not_followed_by_U',
    'https://sudoku.com/',
    'https://reddit.com/',
    'https://visiteurope.com/en/',
    'https://sincxpress.com/',
    'https://openai.com/',
    'https://theuselessweb.com/',
    'https://maps.google.com/',
    'https://pigeonsarentreal.co.uk/',
]

In [8]:
web_infos = np.zeros( (len(weburls),3) )

for idx,web in enumerate(weburls):
    
    text = requests.get(web).text
    tokens = gbt4_tokenizer.encode(text)

    compression = len(tokens)/len(text)

    web_infos[idx,1] = len(tokens)
    web_infos[idx,0] = len(text)
    web_infos[idx,2] = compression*100

In [None]:
print("| Websites               | Characters  | Tokens    | Compression |\n"+"-"*67)

for i in range(11):
    print(f"| {urlparse(weburls[i]).hostname:22} | {int(web_infos[i,0]):<11,} | {int(web_infos[i,1]):<8,}  | {web_infos[i,2]:>10.2f}% |\n")

| Websites               | Characters  | Tokens    | Compression |
-------------------------------------------------------------------
| python.org             | 50,172      | 12,781    |      25.47% |

| pytorch.org            | 388,671     | 111,425   |      28.67% |

| en.wikipedia.org       | 92          | 26        |      28.26% |

| sudoku.com             | 145,997     | 52,589    |      36.02% |

| reddit.com             | 460,477     | 143,867   |      31.24% |

| visiteurope.com        | 124,950     | 34,344    |      27.49% |

| sincxpress.com         | 25,580      | 6,843     |      26.75% |

| openai.com             | 11,533      | 6,398     |      55.48% |

| theuselessweb.com      | 4,756       | 1,329     |      27.94% |

| maps.google.com        | 211,258     | 107,369   |      50.82% |

| pigeonsarentreal.co.uk | 243,863     | 71,232    |      29.21% |



## Exercise 3: Analyze String Library Attributes
- Use all attributes from the Python `string` library as the dataset.
- Calculate and display their compression ratios.
- **Tip:** Use `dir(string)` to access all string attributes.

In [None]:
# I got some help in this exercise (I should dig OOP)

print("| Attribute       | Characters | Tokens | Compression |\n"+"-"*55)

for k,v in string.__dict__.items():
  if isinstance(v,str) and (len(v)>0):

    # get the text
    num_chars = len(v)

    # tokenize
    tokens = gbt4_tokenizer.encode(v)
    num_tokens = len(tokens)

    # compression ratio
    compress = 100*num_tokens/num_chars

    print(f'| {k:15} | {num_chars:>10,} | {num_tokens:>6,} |  {compress:>5.2f}%     |')

| Attribute       | Characters | Tokens | Compression |
-------------------------------------------------------
| __name__        |          6 |      1 |  16.67%     |
| __doc__         |        622 |    109 |  17.52%     |
| __file__        |         38 |     12 |  31.58%     |
| __cached__      |         63 |     22 |  34.92%     |
| whitespace      |          6 |      4 |  66.67%     |
| ascii_lowercase |         26 |      1 |   3.85%     |
| ascii_uppercase |         26 |      1 |   3.85%     |
| ascii_letters   |         52 |      2 |   3.85%     |
| digits          |         10 |      4 |  40.00%     |
| hexdigits       |         22 |      7 |  31.82%     |
| octdigits       |          8 |      3 |  37.50%     |
| punctuation     |         32 |     21 |  65.62%     |
| printable       |        100 |     31 |  31.00%     |
