In [1]:
%%capture
!pip install transformers
!pip install datasets
!pip install --upgrade sacrebleu sentencepiece
import torch
from torch.utils.data import Dataset, random_split
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelForCausalLM
import pandas as pd
from datetime import datetime

First, we scrape the data we are going to need to train the model. There are lots of different data sources you can use, I found this easy to follow tutorial from geeksforgeeks so we are gonna get our data from horoscope.com.
https://www.geeksforgeeks.org/how-to-check-horoscope-using-python/

In [2]:
#first, define the function that will scrape the url
import requests
from bs4 import BeautifulSoup 
  
def horoscope(zodiac_sign: int, day: str) -> str:
    url = (
        "https://www.horoscope.com/us/horoscopes/general/"
         f"horoscope-archive.aspx?sign={zodiac_sign}&laDate={day}"
    )
    # soup will contain all the website's data
    soup = BeautifulSoup(requests.get(url).content, 
                         "html.parser") 
      
    # we will search for main-horoscope
    # class and we will simply return it
    return soup.find("div", class_="main-horoscope").p.text

In [3]:
#We are going to scrape a database for each horoscope sign and date and we are gonna go back a few years to build up a large enough amount of data to make the training impactful
#example scrape
sign_map={'aries':1,'taurus':2,'gemini':3,
      'cancer':4,'leo':5,'virgo':6,'libra':7,
      'scorpio':8,'sagittarius':9,'capricorn':10,
      'aquarius':11,'pisces':12} 
horoscope_text = horoscope(sign_map["pisces"], 20230305) 
print(horoscope_text)


Mar 5, 2023 - Chances to pursue opportunities to bring whatever creative work you do best to the public could come up today, Pisces. This might involve performances, exhibitions, trade shows, or festivals - anything that involves a lot of attention from the public. You will be in the limelight and outshine almost everyone! This is likely to be a lot of fun. It should definitely boost your ego.


In [2]:
#let's create a list of dates going back 1000 days to grab data
datelist = pd.date_range(datetime.today(), periods=1000).tolist()
clean_datelist = []
#putting the dates in the right format for the url
for i in datelist:
  cleaned_date = i.strftime("%Y%m%d")
  clean_datelist.append(i)


In [5]:
#now grab the data for each of the horoscopes on each of the dates (this takes a while ~30min)
#initialize the empty dataframe to store the horoscope data
reverse_sign_map={1:'aries',2:'taurus',3:'gemini',
      4:'cancer',5:'leo',6:'virgo',7:'libra',
      8:'scorpio',9:'sagittarius',10:'capricorn',
      11:'aquarius',12:'pisces'} 

# create an Empty DataFrame object with the pandas library, a really great object that just stores our data for later use!
horoscope_df = pd.DataFrame(columns = ['Date', 'Sign', 'Fortune'])
for date in clean_datelist:
  for i in range(1,13):
    horoscope_text = horoscope(i, date).split(" - ")[1] #grab just the fortune and cut off the date prefix
    tmp_df = pd.DataFrame({'Date' : date, 'Sign' : reverse_sign_map[i], 'Fortune' : horoscope_text},index=[0])
    horoscope_df = pd.concat([horoscope_df, pd.DataFrame({'Date' : date, 'Sign' : reverse_sign_map[i], 'Fortune' : horoscope_text},index=[0])], ignore_index=True)



In [6]:
print(len(horoscope_df)) #double check the length is around 6,000


6000


In [7]:
#let's grab the averge length of each fortune for an approximation of the encoding length later on
horoscope_df["Fortune"].apply(len).mean()

376.1666666666667

You can even customize the data if you want! Or you can write your own dataset of costume fortunes! Just beware you are gonna have to make a lot of fortunes if you want to impactfully train the model later on!

In [9]:
#here I just make a superficial change of intellectual to smart in the dataset but you can play around and change anything!
horoscope_df.replace("intellectual", "smart") 


Unnamed: 0,Date,Sign,Fortune
0,2023-03-28 15:02:29.188347,aries,Feel free to upset the equilibrium in order to...
1,2023-03-28 15:02:29.188347,taurus,"Nothing is too hot for you to handle, but why ..."
2,2023-03-28 15:02:29.188347,gemini,Your physical energy is strong. Your desire fo...
3,2023-03-28 15:02:29.188347,cancer,Matters of the heart may not be going smoothly...
4,2023-03-28 15:02:29.188347,leo,"When it comes to love and romance, you're prob..."
...,...,...,...
5995,2024-08-08 15:02:29.188347,scorpio,"Bring more fire and passion to your love life,..."
5996,2024-08-08 15:02:29.188347,sagittarius,Recent conflicts may stir up some anger in you...
5997,2024-08-08 15:02:29.188347,capricorn,You may not be having the best of luck when it...
5998,2024-08-08 15:02:29.188347,aquarius,"Love and romance are in the air tonight, so fe..."


In [10]:
horoscope_df.to_csv("horoscope_data.csv") #write to csv so you don't have to scrape again!

In [2]:
horoscope_df = pd.read_csv("horoscope_data.csv")
print(len(horoscope_df))
horoscope_df = horoscope_df.loc[0:12000] #use this to select a smaller subsection of your data if your notebook runs out of memory!

12000


Now that we have our data, we are gonna pull the pretrained gpt2-medium and fine-tune it with our fortune telling data. We also pull the gpt2-medium pretrained tokenizer.

In [3]:
tokenizer = AutoTokenizer.from_pretrained("gpt2") #load the tokenizer for gpt2-medium
model = AutoModelForCausalLM.from_pretrained("gpt2")
model = model.cuda()

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [4]:
tokenizer.pad_token = tokenizer.eos_token

Now we run our code to process our fortune dataset to it gives a fortune based on the date and sign of the person asking. 

In [5]:
class FortuneDataset(Dataset):
    def __init__(self, examples, tokenizer):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for data in examples.itertuples():
            date = str(data.Date).split(' ')[0]
            sign = data.Sign
            fortune = data.Fortune
            prompt = "Prompt: Yeong-sil, my sign is " + sign + " and today is " + date + ". What is the weather today? \nHere is your fortune: "
            training_text = prompt + fortune+"<|endoftext|>"

            encodings_dict = tokenizer(training_text, max_length=256, padding="max_length", truncation=True) #pads it out to 275
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))

            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask'])) #1's for all tokens you want to train on then 0 for others
            prompt_len = len(tokenizer.encode(prompt))
            
            masked_labels = [-100]*prompt_len + encodings_dict['input_ids'][prompt_len:]
            self.labels.append(torch.tensor(masked_labels))


    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {'input_ids':self.input_ids[idx], 'attention_mask':self.attn_masks[idx], 'labels':self.labels[idx]}

In [6]:
#split the data into a training set and a validation set
fortune_dataset = FortuneDataset(horoscope_df, tokenizer)
train_size = int(0.9 * len(fortune_dataset))
train_dataset, val_dataset = random_split(fortune_dataset, [train_size, len(fortune_dataset) - train_size])


In [7]:
from transformers import TrainingArguments
from transformers import DataCollatorForLanguageModeling

training_args = TrainingArguments(
  output_dir='./results',
  num_train_epochs=2
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()



Step,Training Loss
500,0.0969
1000,0.0005
1500,0.0002
2000,0.0002
2500,0.0001


TrainOutput(global_step=2700, training_loss=0.01813363520497525, metrics={'train_runtime': 1590.0992, 'train_samples_per_second': 13.584, 'train_steps_per_second': 1.698, 'total_flos': 2821953945600000.0, 'train_loss': 0.01813363520497525, 'epoch': 2.0})

Now test it out! Below I gave the model todays date and the leo sign, but feel free to edit those to your liking! What's cool is that slight spelling errors or formatting errors shouldn't throw the model off too much so feel free to play around and see what fortunes you can generate!

In [10]:
sign = "virgo"
date = 20230327
###edit the above to get different fortunes###

prompt = "Prompt: Yeong-sil, my sign is " + sign + " and today is " + str(date) + ". What is the weather today? \nHere is your fortune: "
# prompt_encoded = torch.tensor(tokenizer.encode(prompt)).cuda()
prompt_encoded = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0).cuda()
model.eval()
sample_outputs = model.generate(prompt_encoded, 
                                do_sample=True,   
                                max_length = 300,
                                num_return_sequences=1)
decoded_prediction = tokenizer.decode(sample_outputs[0], skip_special_tokens=True)
print(decoded_prediction)

Prompt: Yeong-sil, my sign is virgo and today is 20230327. What is the weather today? 
Here is your fortune:  Virgo is one of the most powerful forces on the planet, and you could be tempted to get into a big fight. Put the weapons away and bring out the olive branch. Take that energy that has built up and use it to fuel your romantic affairs instead of warlike ventures. Defuse the situation by sharing passionate nights with the one you love.


In [16]:
sign = "horsey"
date = 20230327
###edit the above to get different fortunes###

prompt = "Prompt: Yeong-sil, my sign is " + sign + " and today is " + str(date) + ". What is the weather today? \nHere is your fortune: "
# prompt_encoded = torch.tensor(tokenizer.encode(prompt)).cuda()
prompt_encoded = torch.tensor(tokenizer.encode(prompt)).unsqueeze(0).cuda()
model.eval()
sample_outputs = model.generate(prompt_encoded, 
                                do_sample=True,   
                                max_length = 300,
                                min_length=100, #optional to add in if your model is having some trouble generating out the full fortune
                                num_return_sequences=1)
decoded_prediction = tokenizer.decode(sample_outputs[0], skip_special_tokens=True)
print(decoded_prediction)

Prompt: Yeong-sil, my sign is horsey and today is 20230327. What is the weather today? 
Here is your fortune: ~~~
Tonight is the best time to be a horsey. Your romantic life is one area where you might do better taking the opposite approach. Have confidence and be spontaneous in all matters having to do with love. The key now is to make sure that you aren't giving yourself away to someone who's unworthy of your love. Match yourself with a person who appreciates you for the amazing person you re.
