We will now attempt to use a pretrained model called 'zero shot classification' on HuggingFace to classify the locations. We will evaluate the pretrained model by comparing its classification to some hand labelled location types before a decision will be made to employ it.

## Contents:
- [Instantiate Classifier Pipeline from HuggingFace](#Instantiate-Classifier-Pipeline-from-HuggingFace)
- [Loading of Libraries](#Loading-of-Libraries) 
- [Loading of Dataset](#Loading-of-Dataset)
- [Zero Shot Pretrained Model Evaluation](#Zero-Shot-Pretrained-Model-Evaluation)

## Instantiate Classifier Pipeline from HuggingFace

In [6]:
# Instantiate zero shot classifier pipeline based on documentation on HuggingFace
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

2023-02-21 11:12:37.324567: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Loading of Libraries

In [15]:
# Import libraries
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()
from sklearn.metrics import accuracy_score

## Loading of Dataset

In [2]:
# Load data
recsys_df = pd.read_csv('./output/recsys_df_name.csv')
print(recsys_df.shape)
recsys_df.head(3)

(177607, 7)


Unnamed: 0,profile_id,location_id,cts,sentiment_pred,name,city,cd
0,4519805.0,1178180.0,2016-06-09 22:13:32,3,la famiglia,"London, United Kingdom",GB
1,259484700.0,1178180.0,2019-05-30 23:17:22,3,la famiglia,"London, United Kingdom",GB
2,6364797000.0,1178180.0,2019-05-26 15:27:27,3,la famiglia,"London, United Kingdom",GB


## Zero Shot Pretrained Model Evaluation

#### Using 100 rows for comparison

In [7]:
# Save and export first 100 rows for zero shot comparison
recsys_df_100 = recsys_df.head(100)
recsys_df_100.to_csv('./output/recsys_df_100.csv', index=False)

In [8]:
# Function for zero shot prediction
def zero_shot_preds(df, classifier, topic_list):
    df["zero_shot_output"] = df["name"].progress_apply(
        classifier, candidate_labels=topic_list
    )
    zero_shot_df = pd.json_normalize(df["zero_shot_output"])
    zero_shot_df["zero_shot_result"] = zero_shot_df['labels'].str[0]
    
    return zero_shot_df

In [9]:
%%time
zero_shot_df_100 = zero_shot_preds(recsys_df_100,classifier,['boat_tours_and_water_sports', 'pubs_and_nightlife', 'sights_and_landmarks', 'spas_and_wellness', 'fun_and_games', 'museums', 'classes_and_workshops', 'nature_and_parks', 'markets', 'neighbourhoods', 'others'])

100%|█████████████████████████████████████████| 100/100 [06:47<00:00,  4.08s/it]

CPU times: user 12min 12s, sys: 11 s, total: 12min 23s
Wall time: 6min 47s



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["zero_shot_output"] = df["name"].progress_apply(


In [10]:
zero_shot_df_100

Unnamed: 0,sequence,labels,scores,zero_shot_result
0,la famiglia,"[others, sights_and_landmarks, fun_and_games, ...","[0.22058841586112976, 0.16727125644683838, 0.0...",others
1,la famiglia,"[others, sights_and_landmarks, fun_and_games, ...","[0.22058841586112976, 0.16727125644683838, 0.0...",others
2,la famiglia,"[others, sights_and_landmarks, fun_and_games, ...","[0.22058841586112976, 0.16727125644683838, 0.0...",others
3,green park,"[nature_and_parks, sights_and_landmarks, fun_a...","[0.506106436252594, 0.14772608876228333, 0.096...",nature_and_parks
4,green park,"[nature_and_parks, sights_and_landmarks, fun_a...","[0.506106436252594, 0.14772608876228333, 0.096...",nature_and_parks
...,...,...,...,...
95,gossip.com,"[others, neighbourhoods, markets, sights_and_l...","[0.4013848900794983, 0.11936227232217789, 0.08...",others
96,gossip.com,"[others, neighbourhoods, markets, sights_and_l...","[0.4013848900794983, 0.11936227232217789, 0.08...",others
97,gossip.com,"[others, neighbourhoods, markets, sights_and_l...","[0.4013848900794983, 0.11936227232217789, 0.08...",others
98,gossip.com,"[others, neighbourhoods, markets, sights_and_l...","[0.4013848900794983, 0.11936227232217789, 0.08...",others


In [11]:
# Load true (hand labelled) location type
recsys_df_100_true = pd.read_csv('./output/recsys_df_100_true.csv')
print(recsys_df_100_true.shape)
recsys_df_100_true.head()

(100, 8)


Unnamed: 0,profile_id,location_id,cts,sentiment_pred,name,city,cd,type_true
0,4519805,1178180,2016-06-09 22:13:32,3,la famiglia,"London, United Kingdom",GB,others
1,259484657,1178180,2019-05-30 23:17:22,3,la famiglia,"London, United Kingdom",GB,others
2,6364796956,1178180,2019-05-26 15:27:27,3,la famiglia,"London, United Kingdom",GB,others
3,221389356,857670431,2019-05-30 21:41:15,2,green park,"London, United Kingdom",GB,nature_and_parks
4,624306558,857670431,2019-05-30 7:56:50,3,green park,"London, United Kingdom",GB,nature_and_parks


In [18]:
# Join both dataset
recsys_df_100 = recsys_df_100_true.join(zero_shot_df_100)
recsys_df_100.head()

Unnamed: 0,profile_id,location_id,cts,sentiment_pred,name,city,cd,type_true,sequence,labels,scores,zero_shot_result
0,4519805,1178180,2016-06-09 22:13:32,3,la famiglia,"London, United Kingdom",GB,others,la famiglia,"[others, sights_and_landmarks, fun_and_games, ...","[0.22058841586112976, 0.16727125644683838, 0.0...",others
1,259484657,1178180,2019-05-30 23:17:22,3,la famiglia,"London, United Kingdom",GB,others,la famiglia,"[others, sights_and_landmarks, fun_and_games, ...","[0.22058841586112976, 0.16727125644683838, 0.0...",others
2,6364796956,1178180,2019-05-26 15:27:27,3,la famiglia,"London, United Kingdom",GB,others,la famiglia,"[others, sights_and_landmarks, fun_and_games, ...","[0.22058841586112976, 0.16727125644683838, 0.0...",others
3,221389356,857670431,2019-05-30 21:41:15,2,green park,"London, United Kingdom",GB,nature_and_parks,green park,"[nature_and_parks, sights_and_landmarks, fun_a...","[0.506106436252594, 0.14772608876228333, 0.096...",nature_and_parks
4,624306558,857670431,2019-05-30 7:56:50,3,green park,"London, United Kingdom",GB,nature_and_parks,green park,"[nature_and_parks, sights_and_landmarks, fun_a...","[0.506106436252594, 0.14772608876228333, 0.096...",nature_and_parks


In [13]:
# Drop 'sequence','labels','scores' cols
recsys_df_100.drop(['sequence','labels','scores'], axis=1, inplace=True)

In [14]:
recsys_df_100.head()

Unnamed: 0,profile_id,location_id,cts,sentiment_pred,name,city,cd,type_true,zero_shot_result
0,4519805,1178180,2016-06-09 22:13:32,3,la famiglia,"London, United Kingdom",GB,others,others
1,259484657,1178180,2019-05-30 23:17:22,3,la famiglia,"London, United Kingdom",GB,others,others
2,6364796956,1178180,2019-05-26 15:27:27,3,la famiglia,"London, United Kingdom",GB,others,others
3,221389356,857670431,2019-05-30 21:41:15,2,green park,"London, United Kingdom",GB,nature_and_parks,nature_and_parks
4,624306558,857670431,2019-05-30 7:56:50,3,green park,"London, United Kingdom",GB,nature_and_parks,nature_and_parks


In [16]:
# Check accuracy of pretrained model
accuracy_score(recsys_df_100['type_true'].values, recsys_df_100['zero_shot_result'].values)

0.3

- The accuracy score of the pretrained model at 0.3 is very poor. We will attempt to compare 200 rows instead.

#### Using 200 rows for comparison

In [19]:
# Save and export first 200 rows for zero shot comparison
recsys_df_200 = recsys_df.head(200)
recsys_df_200.to_csv('./output/recsys_df_200.csv', index=False)

In [21]:
%%time
zero_shot_df_200 = zero_shot_preds(recsys_df_200,classifier,['boat_tours_and_water_sports', 'pubs_and_nightlife', 'sights_and_landmarks', 'spas_and_wellness', 'fun_and_games', 'museums', 'classes_and_workshops', 'nature_and_parks', 'markets', 'neighbourhoods', 'others'])

100%|█████████████████████████████████████████| 200/200 [16:34<00:00,  4.97s/it]

CPU times: user 26min 59s, sys: 31.2 s, total: 27min 30s
Wall time: 16min 34s



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["zero_shot_output"] = df["name"].progress_apply(


In [22]:
zero_shot_df_200.head()

Unnamed: 0,sequence,labels,scores,zero_shot_result
0,la famiglia,"[others, sights_and_landmarks, fun_and_games, ...","[0.22058841586112976, 0.16727125644683838, 0.0...",others
1,la famiglia,"[others, sights_and_landmarks, fun_and_games, ...","[0.22058841586112976, 0.16727125644683838, 0.0...",others
2,la famiglia,"[others, sights_and_landmarks, fun_and_games, ...","[0.22058841586112976, 0.16727125644683838, 0.0...",others
3,green park,"[nature_and_parks, sights_and_landmarks, fun_a...","[0.506106436252594, 0.14772608876228333, 0.096...",nature_and_parks
4,green park,"[nature_and_parks, sights_and_landmarks, fun_a...","[0.506106436252594, 0.14772608876228333, 0.096...",nature_and_parks


In [24]:
# Load true (hand labelled) location type
recsys_df_200_true = pd.read_csv('./output/recsys_df_200_true.csv')
print(recsys_df_200_true.shape)
recsys_df_200_true.head()

(200, 8)


Unnamed: 0,profile_id,location_id,cts,sentiment_pred,name,city,cd,type_true
0,4519805,1178180.0,2016-06-09 22:13:32,3,la famiglia,"London, United Kingdom",GB,others
1,259484657,1178180.0,2019-05-30 23:17:22,3,la famiglia,"London, United Kingdom",GB,others
2,6364796956,1178180.0,2019-05-26 15:27:27,3,la famiglia,"London, United Kingdom",GB,others
3,221389356,857670431.0,2019-05-30 21:41:15,2,green park,"London, United Kingdom",GB,nature_and_parks
4,624306558,857670431.0,2019-05-30 7:56:50,3,green park,"London, United Kingdom",GB,nature_and_parks


In [28]:
# Join both dataset
recsys_df_200 = recsys_df_200_true.join(zero_shot_df_200)
recsys_df_200.head()

Unnamed: 0,profile_id,location_id,cts,sentiment_pred,name,city,cd,type_true,sequence,labels,scores,zero_shot_result
0,4519805,1178180.0,2016-06-09 22:13:32,3,la famiglia,"London, United Kingdom",GB,others,la famiglia,"[others, sights_and_landmarks, fun_and_games, ...","[0.22058841586112976, 0.16727125644683838, 0.0...",others
1,259484657,1178180.0,2019-05-30 23:17:22,3,la famiglia,"London, United Kingdom",GB,others,la famiglia,"[others, sights_and_landmarks, fun_and_games, ...","[0.22058841586112976, 0.16727125644683838, 0.0...",others
2,6364796956,1178180.0,2019-05-26 15:27:27,3,la famiglia,"London, United Kingdom",GB,others,la famiglia,"[others, sights_and_landmarks, fun_and_games, ...","[0.22058841586112976, 0.16727125644683838, 0.0...",others
3,221389356,857670431.0,2019-05-30 21:41:15,2,green park,"London, United Kingdom",GB,nature_and_parks,green park,"[nature_and_parks, sights_and_landmarks, fun_a...","[0.506106436252594, 0.14772608876228333, 0.096...",nature_and_parks
4,624306558,857670431.0,2019-05-30 7:56:50,3,green park,"London, United Kingdom",GB,nature_and_parks,green park,"[nature_and_parks, sights_and_landmarks, fun_a...","[0.506106436252594, 0.14772608876228333, 0.096...",nature_and_parks


In [29]:
# Save and export for zero shot comparison tuning
recsys_df_200.to_csv('./output/recsys_df_200_update.csv', index=False)

In [40]:
# Drop 'sequence','labels','scores' cols
recsys_df_200.drop(['sequence','labels','scores'], axis=1, inplace=True)
recsys_df_200.head()

Unnamed: 0,profile_id,location_id,cts,sentiment_pred,name,city,cd,type_true,zero_shot_result
0,4519805,1178180.0,2016-06-09 22:13:32,3,la famiglia,"London, United Kingdom",GB,others,others
1,259484657,1178180.0,2019-05-30 23:17:22,3,la famiglia,"London, United Kingdom",GB,others,others
2,6364796956,1178180.0,2019-05-26 15:27:27,3,la famiglia,"London, United Kingdom",GB,others,others
3,221389356,857670431.0,2019-05-30 21:41:15,2,green park,"London, United Kingdom",GB,nature_and_parks,nature_and_parks
4,624306558,857670431.0,2019-05-30 7:56:50,3,green park,"London, United Kingdom",GB,nature_and_parks,nature_and_parks


In [27]:
# Check accuracy of pretrained model
accuracy_score(recsys_df_200['type_true'].values, recsys_df_200['zero_shot_result'].values)

0.395

- The accuracy score of the pretrained model has improved when compared with more rows of data. We will attempt to compare with an additional 100 rows of hand-labelled location type to further gauge the performance of the pretrained model.

#### Using 300 rows for comparison

In [31]:
# Save and export next 100 rows for zero shot comparison
recsys_df_201to300 = recsys_df.iloc[201:301]
recsys_df_201to300.to_csv('./output/recsys_df_201to300.csv', index=False)

In [32]:
%%time
zero_shot_df_201to300 = zero_shot_preds(recsys_df_201to300,classifier,['boat_tours_and_water_sports', 'pubs_and_nightlife', 'sights_and_landmarks', 'spas_and_wellness', 'fun_and_games', 'museums', 'classes_and_workshops', 'nature_and_parks', 'markets', 'neighbourhoods', 'others'])

100%|█████████████████████████████████████████| 100/100 [09:41<00:00,  5.82s/it]

CPU times: user 14min 6s, sys: 22.7 s, total: 14min 29s
Wall time: 9min 41s



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["zero_shot_output"] = df["name"].progress_apply(


In [33]:
zero_shot_df_201to300.head()

Unnamed: 0,sequence,labels,scores,zero_shot_result
0,souli food,"[others, sights_and_landmarks, spas_and_wellne...","[0.23861777782440186, 0.13402360677719116, 0.1...",others
1,souli food,"[others, sights_and_landmarks, spas_and_wellne...","[0.23861777782440186, 0.13402360677719116, 0.1...",others
2,tesco,"[spas_and_wellness, markets, sights_and_landma...","[0.23894090950489044, 0.13002963364124298, 0.0...",spas_and_wellness
3,neal's yard remedies,"[spas_and_wellness, nature_and_parks, sights_a...","[0.2223985493183136, 0.1895524263381958, 0.178...",spas_and_wellness
4,neal's yard remedies,"[spas_and_wellness, nature_and_parks, sights_a...","[0.2223985493183136, 0.1895524263381958, 0.178...",spas_and_wellness


In [35]:
# Load true (hand labelled) location type
recsys_df_201to300_true = pd.read_csv('./output/recsys_df_201to300_true.csv')
print(recsys_df_201to300_true.shape)
recsys_df_201to300_true.head()

(100, 8)


Unnamed: 0,profile_id,location_id,cts,sentiment_pred,name,city,cd,type_true
0,2240117505,231412800.0,2019-01-31 13:57:31,3,souli food,"London, United Kingdom",GB,others
1,537500464,231412800.0,2018-10-30 7:49:01,3,souli food,"London, United Kingdom",GB,others
2,8262374553,1022884000.0,2018-07-30 15:39:19,3,tesco,"London, United Kingdom",GB,others
3,400303321,1019567000.0,2017-06-22 14:21:12,3,neal's yard remedies,"London, United Kingdom",GB,others
4,397556091,1019567000.0,2018-01-07 21:55:06,2,neal's yard remedies,"London, United Kingdom",GB,others


In [36]:
# Join both dataset
recsys_df_201to300 = recsys_df_201to300_true.join(zero_shot_df_201to300)
recsys_df_201to300.head()

Unnamed: 0,profile_id,location_id,cts,sentiment_pred,name,city,cd,type_true,sequence,labels,scores,zero_shot_result
0,2240117505,231412800.0,2019-01-31 13:57:31,3,souli food,"London, United Kingdom",GB,others,souli food,"[others, sights_and_landmarks, spas_and_wellne...","[0.23861777782440186, 0.13402360677719116, 0.1...",others
1,537500464,231412800.0,2018-10-30 7:49:01,3,souli food,"London, United Kingdom",GB,others,souli food,"[others, sights_and_landmarks, spas_and_wellne...","[0.23861777782440186, 0.13402360677719116, 0.1...",others
2,8262374553,1022884000.0,2018-07-30 15:39:19,3,tesco,"London, United Kingdom",GB,others,tesco,"[spas_and_wellness, markets, sights_and_landma...","[0.23894090950489044, 0.13002963364124298, 0.0...",spas_and_wellness
3,400303321,1019567000.0,2017-06-22 14:21:12,3,neal's yard remedies,"London, United Kingdom",GB,others,neal's yard remedies,"[spas_and_wellness, nature_and_parks, sights_a...","[0.2223985493183136, 0.1895524263381958, 0.178...",spas_and_wellness
4,397556091,1019567000.0,2018-01-07 21:55:06,2,neal's yard remedies,"London, United Kingdom",GB,others,neal's yard remedies,"[spas_and_wellness, nature_and_parks, sights_a...","[0.2223985493183136, 0.1895524263381958, 0.178...",spas_and_wellness


In [37]:
# Drop 'sequence','labels','scores' cols
recsys_df_201to300.drop(['sequence','labels','scores'], axis=1, inplace=True)
recsys_df_201to300.head()

Unnamed: 0,profile_id,location_id,cts,sentiment_pred,name,city,cd,type_true,zero_shot_result
0,2240117505,231412800.0,2019-01-31 13:57:31,3,souli food,"London, United Kingdom",GB,others,others
1,537500464,231412800.0,2018-10-30 7:49:01,3,souli food,"London, United Kingdom",GB,others,others
2,8262374553,1022884000.0,2018-07-30 15:39:19,3,tesco,"London, United Kingdom",GB,others,spas_and_wellness
3,400303321,1019567000.0,2017-06-22 14:21:12,3,neal's yard remedies,"London, United Kingdom",GB,others,spas_and_wellness
4,397556091,1019567000.0,2018-01-07 21:55:06,2,neal's yard remedies,"London, United Kingdom",GB,others,spas_and_wellness


In [42]:
# Concat the above 100 rows to first 200 rows
recsys_df_300 = pd.concat([recsys_df_200,recsys_df_201to300])

In [43]:
# Check accuracy of pretrained model
accuracy_score(recsys_df_300['type_true'].values, recsys_df_300['zero_shot_result'].values)

0.2866666666666667

- The accuracy score of the pretrained model is worse. 
- The pretrained model is unable to categorise the locations based on the given categories. 
- As such, we will abort the use of this pretrained model. 
- We will proceed to building a collaborative-filtering recommender system without the location type feature.