# Fine Tuning

Fine-tuning is to further train a pre-trained model. Continuing to train a pre-trained model is to update its weights (parameters) to work better on your dataset.

If you're familiar with using ANNs in TensorFlow, this is as simple as continueing training of the model on the new dataset.

I wanted to use the dataset and follow the example on the Hugging Face course, but I could not get any dataset to load. I couldhave used `requests` to work-around this, but I'm going to work on a the Sephora makeup reviews dataset from Kaggle.

The `requests` work-around would be something like:
```
import requests
r = requests.get("uri_to_dataset")
r.json()
```

Also, I do not have enough ram to hold the dataset so I utilize file IO.

In [19]:
'''
Condense the five files containing reviews into one.
'''
import os

FILENAME = 'reviews_complete.csv'
PATH = '.\\Sephora Reviews'
files = os.listdir(PATH)
num_files = range(len(files))

if not os.path.exists(os.getcwd() + '\\' + FILENAME):
    with open(FILENAME, 'wb') as f:
        for i, file in zip(num_files, files):
            if not i:
                with open(PATH + '\\' + file, 'rb') as f2:
                    f.write(f2.read())
            else: # Then who!
                with open(PATH + '\\' + file, 'rb') as f2:
                    f2.readline()
                    f.write(f2.read())

In [77]:
'''
Pull 10000 records at a time.
Transform them.
Place them in a new file.
'''
import pandas as pd

# Variables/Constants
FILENAME = 'sephora_reviews_preprocessed.csv'
col_names = pd.read_csv('reviews_complete.csv', nrows = 1).columns
skiprows = 1 # Skip first one bc we have headers
NROWS = 10000
num_rows = 10000

# lambda
clean_nans = lambda x : '' if x is np.nan else x

# prepreprocessing
if not os.path.exists(os.getcwd() + '\\' + FILENAME):
    with open(FILENAME, 'w') as f:
        while num_rows >= NROWS:
            dataset = pd.read_csv('reviews_complete.csv', skiprows = skiprows, nrows = NROWS, names = col_names)
            num_rows = dataset.shape[0]
            dataset['full_review'] = "Review title: " + dataset.review_title + "; Review text: " + dataset.review_text
            dataset.full_review = dataset.full_review.apply(clean_nans)

            f.writelines(
                [ 
                  ','.join( [str(token) for token in tokens] ) + "\n" 
                  for tokens in tokenizer(list(dataset.full_review), padding = True,).input_ids 
                ]
            )

            skiprows += NROWS

In [74]:
dataset.shape

(5, 20)

In [21]:
'''
Tokenize the reviews
'''

import tensorflow as tf
import numpy as np
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)




  from .autonotebook import tqdm as notebook_tqdm





All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
tokenizer(list(dataset.full_review), padding = True)

In [48]:
list(map(str, tokenizer(list(dataset.full_review)).input_ids[0]))

['0',
 '32773',
 '1270',
 '35',
 '255',
 '19906',
 '162',
 '141',
 '7',
 '1457',
 '2382',
 '1090',
 '328',
 '131',
 '5872',
 '2788',
 '35',
 '38',
 '304',
 '42',
 '19',
 '5',
 '234',
 '1906',
 '990',
 '3181',
 '44',
 '48',
 '347',
 '405',
 '14888',
 '10326',
 '4317',
 '119',
 '359',
 '5293',
 '12',
 '10926',
 '30371',
 '17',
 '48',
 '7',
 '1457',
 '2382',
 '1090',
 '8',
 '24',
 '34',
 '2198',
 '1714',
 '127',
 '3024',
 '36',
 '1990',
 '5',
 '357',
 '322',
 '20',
 '146',
 '12',
 '658',
 '20147',
 '16',
 '681',
 '716',
 '8',
 '24508',
 '70',
 '9',
 '110',
 '7855',
 '2422',
 '2773',
 '4',
 '38',
 '1407',
 '12',
 '658',
 '19',
 '42',
 '514',
 '716',
 '30317',
 '254',
 '6',
 '8',
 '38',
 '67',
 '304',
 '42',
 '95',
 '30',
 '1495',
 '77',
 '38',
 '17',
 '27',
 '119',
 '45',
 '2498',
 '146',
 '12',
 '658',
 '4',
 '85',
 '3607',
 '5',
 '3024',
 '18399',
 '30317',
 '196',
 '6',
 '53',
 '396',
 '28546',
 '5',
 '3024',
 '4',
 '158',
 '73',
 '698',
 '5940',
 '15224',
 '19',
 '5',
 '146',
 '12',
 '

In [61]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1301136 entries, 0 to 1301135
Data columns (total 19 columns):
 #   Column                    Non-Null Count    Dtype  
---  ------                    --------------    -----  
 0   Unnamed: 0                1301136 non-null  int64  
 1   author_id                 1301136 non-null  object 
 2   rating                    1301136 non-null  int64  
 3   is_recommended            1107162 non-null  float64
 4   helpfulness               631670 non-null   float64
 5   total_feedback_count      1301136 non-null  int64  
 6   total_neg_feedback_count  1301136 non-null  int64  
 7   total_pos_feedback_count  1301136 non-null  int64  
 8   submission_time           1301136 non-null  object 
 9   review_text               1299520 non-null  object 
 10  review_title              930754 non-null   object 
 11  skin_tone                 1103798 non-null  object 
 12  eye_color                 1057734 non-null  object 
 13  skin_type                 1