# TF-IDF Recommendation on the Players Mock Dataset for Users

Implementation of Term Frequency Inverse Document Frequency (TF-IDF).

## Imports

In [15]:
import pandas as pd
from recommenders.models.tfidf.tfidf_utils import TfidfRecommender
from sklearn.feature_extraction.text import TfidfVectorizer

## 1. Load dataset into dataframe

In [16]:
match_df = pd.read_csv('datasets/player_data.csv')

print(match_df.head())
print('\nNumber of entries in dataset: ' + str(len(match_df)))


                                   UUID         PlayerName         PlayerTeam  \
0  b906eb96-4c26-49a9-bc8f-c7989ae5cc67        Luka Modrić        Real Madrid   
1  539c9764-bf4f-4a2f-8b62-1489893be273       Sergio Ramos        Real Madrid   
2  ecb389db-5a98-4b3d-8662-0d8b7d5c196e         Paul Pogba  Manchester United   
3  69c0383f-66ca-4d81-80e3-b3fa0dcd93e0  Cristiano Ronaldo           Al-Nassr   
4  24de84b4-b7f6-4cf9-a8bc-e74d40b50f6e  Antoine Griezmann    Atlético Madrid   

  PlayerCountry  userId userRiskType UserVote  BetPlaced  BetAmount BetOutcome  
0       Croatie      15          Low  Dislike       True      78.66    Pending  
1       Espagne      10         High     Like      False       0.00        NaN  
2        France      17       Medium     Like      False       0.00        NaN  
3      Portugal       1         High  Neutral       True      39.12       Lose  
4        France       2       Medium  Dislike      False       0.00        NaN  

Number of entries in datas

## 2. Instantiate the recommender
Select one of the following tokenization methods for the model:
| tokenization_method | Description                                                                                                                      |
|:--------------------|:---------------------------------------------------------------------------------------------------------------------------------|
| 'none'              | No tokenization is applied. Each word is considered a token.                                                                     |
| 'nltk'              | Simple stemming is applied using NLTK.                                                                                           |
| 'bert'              | HuggingFace BERT word tokenization ('bert-base-cased') is applied.                                                               |
| 'scibert'           | SciBERT word tokenization ('allenai/scibert_scivocab_cased') is applied.<br>This is recommended for scientific journal articles. |

Casing (capitalization) is preserved for [BERT-based tokenization methods](https://huggingface.co/transformers/model_doc/bert.html), but is removed for simple or no tokenization.

In [17]:
recommender = TfidfRecommender(id_col='UUID', tokenization_method='scibert')

# 3. Prepare text for use

In [None]:
# Combine text columns
match_df['combined'] = match_df['PlayerName'] + ' ' + match_df['PlayerTeam'] + ' ' + match_df['PlayerCountry']

# Initialize the TF-IDF vectorizer
#vectorizer = TfidfVectorizer(stop_words='english')

# Apply TF-IDF to the combined text data to remove common english words
# like "the", "and", ...
#tfidf_matrix = vectorizer.fit_transform(match_df['combined'])

# Convert the TF-IDF matrix to a DataFrame for better readability
#tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

#print(tfidf_df)
df_clean = recommender.clean_dataframe(match_df)

           al  angleterre   antoine  argentina  atlético       bas   benzema  \
0    0.000000    0.000000  0.000000        0.0  0.000000  0.000000  0.000000   
1    0.000000    0.000000  0.000000        0.0  0.000000  0.000000  0.000000   
2    0.000000    0.000000  0.000000        0.0  0.000000  0.000000  0.000000   
3    0.447214    0.000000  0.000000        0.0  0.000000  0.000000  0.000000   
4    0.000000    0.000000  0.513927        0.0  0.513927  0.000000  0.000000   
..        ...         ...       ...        ...       ...       ...       ...   
195  0.000000    0.000000  0.000000        0.0  0.000000  0.000000  0.559823   
196  0.000000    0.000000  0.513927        0.0  0.513927  0.000000  0.000000   
197  0.000000    0.447214  0.000000        0.0  0.000000  0.000000  0.000000   
198  0.000000    0.000000  0.000000        0.0  0.000000  0.449709  0.000000   
199  0.447214    0.000000  0.000000        0.0  0.000000  0.000000  0.000000   

       bruyne  brésil      city  ...  p