# Scoring system with TF-IDF

## Summary of the approach in this notebook

**Text Preprocessing:** We apply several preprocessing steps to the text data to make it easier for our algorithms to work with. This includes transforming all text to lowercase, removing punctuation and numeric values, splitting the text into individual words (tokenization), removing common words (stopwords), and reducing words to their base or root form (lemmatization). The preprocessed data is then saved for further analysis.

**Vectorization:** The preprocessed text is transformed into a numerical format using TF-IDF (Term Frequency-Inverse Document Frequency). This results in a matrix where each row represents a senator's initiatives and each column represents a word. The value in each cell is the TF-IDF value of the word in the corresponding document.

**Similarity Calculation:** We then use this vectorized data to match a user's interests to the senators' initiatives. This is done by preprocessing and vectorizing the user's input in the same way as the senator's initiatives, and then calculating the cosine similarity between the user's vector and each senator's vector. The cosine similarity provides a score between 0 (completely dissimilar) and 1 (completely similar) for each senator, indicating how closely their initiatives match the user's interests.

**Output:** Finally, we rank the senators based on their similarity scores, providing a list of the senators whose initiatives best match the user's input.

## Import Libraries

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
import re
import stanza
import string
import os
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
from sklearn.metrics.pairwise import cosine_similarity

## Initialize testing DF

This dataframe is to test the functions in this notebook with a limited set of data

In [2]:
SENATORS_TO_PROCESS = 3

current_path = os.getcwd()
parent_directory = os.path.dirname(current_path)
project_data_path = os.path.join(parent_directory, 'data')


senators_test_df = pd.read_csv(os.path.join(project_data_path, 'senators_data.csv')).head(SENATORS_TO_PROCESS)

## Stopwords

### Downloading the Spanish stopword

In [3]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/luis/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Define spanish stopwords

In [4]:
stop_words = set(stopwords.words('spanish'))

## Lemmatization

### Download spanish tools

In [5]:
# downloads tools for processing spanish text
stanza.download('es')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-07-12 10:37:49 INFO: Downloading default packages for language: es (Spanish) ...
2023-07-12 10:37:51 INFO: File exists: /Users/luis/stanza_resources/es/default.zip
2023-07-12 10:37:54 INFO: Finished downloading models and saved to /Users/luis/stanza_resources.


### Initialize Stanza's neural pipeline

In [6]:
# Stanza's pipelines contain tools for processing spanish text
nlp = stanza.Pipeline('es')

2023-07-12 10:37:55 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-07-12 10:37:56 INFO: Loading these models for language: es (Spanish):
| Processor    | Package  |
---------------------------
| tokenize     | ancora   |
| mwt          | ancora   |
| pos          | ancora   |
| lemma        | ancora   |
| constituency | combined |
| depparse     | ancora   |
| sentiment    | tass2020 |
| ner          | conll02  |

2023-07-12 10:37:56 INFO: Using device: cpu
2023-07-12 10:37:56 INFO: Loading: tokenize
2023-07-12 10:37:56 INFO: Loading: mwt
2023-07-12 10:37:56 INFO: Loading: pos
2023-07-12 10:37:56 INFO: Loading: lemma
2023-07-12 10:37:56 INFO: Loading: constituency
2023-07-12 10:37:57 INFO: Loading: depparse
2023-07-12 10:37:57 INFO: Loading: sentiment
2023-07-12 10:37:57 INFO: Loading: ner
2023-07-12 10:37:58 INFO: Done loading processors!


## Preprocessing

### Preprocess flow

In [7]:
def preprocess_text(text):
    # Lowercase the text
    text = text.lower()
    
    # Remove special characters
    text = re.sub(r'\[.*?\]', '', text) # remove enclosed text i.e. [este texto esta entre llaves]
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text) # remove punctuation
    text = re.sub(r'\w*\d\w*', '', text) #remove alphanumeric characters
    
    # Tokenization and filtering stop words
    text = nltk.word_tokenize(text)
    text = [word for word in text if word not in stop_words]
    
    # Lemmatization
    doc = nlp(' '.join(text))
    lemmas = []
    
    for sentence in doc.sentences:
        for word in sentence.words:
            lemmas.append(word.lemma)
            
    text = ' '.join(lemmas)
    
    return text

### Run preprocess flow

#### Run preprocess flow

In [8]:
# Apply the preprocessing to the 'initiatives_summary_dummy' column
senators_test_df['initiatives_summary_preprocessed'] = senators_test_df['initiatives_summary_dummy'].apply(preprocess_text)

# Save the preprocessed dataframe
senators_test_df.to_csv(os.path.join(project_data_path, 'senators_processed_data.csv'), index=False)

## Vectorize data

In [9]:
processed_df = pd.read_csv(os.path.join(project_data_path, 'senators_processed_data.csv'))

In [10]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(processed_df['initiatives_summary_preprocessed'])

### Save tfidf_matrix and vectorizer

Here's why we need to save both:

**TF-IDF Matrix:** This matrix is a numerical representation of our preprocessed text data. Each row corresponds to a senator and each column corresponds to a word. The value in each cell is the TF-IDF value of the word for the corresponding senator. By saving this matrix, we keep a record of how each word is associated with each senator, based on the senator initiatives. We'll use this matrix to compare the user's input with each senator's initiatives.

**Fitted Vectorizer:** This is the object that we used to convert our text data into the TF-IDF matrix. It has been 'fitted' to our text data, meaning it has learned the vocabulary of our text data. When we get a new piece of text (the user's input), we'll need to convert it into the same TF-IDF format as our existing data. To do this, we'll use the same vectorizer that we used for our original data. By using the fitted vectorizer, we ensure that the user's input is transformed in the same way as our original text data.

In summary, we save the TF-IDF matrix and the fitted vectorizer so that we can use them later to match user input to senator initiatives.

In [11]:
# Save the TF-IDF matrix and the fitted vectorizer for later use
with open('tfidf_matrix.pkl', 'wb') as f:
    pickle.dump(X, f)

with open('fitted_vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)

## Test user input

### Load the TF-IDF matrix and the fitted vectorizer

In [12]:
with open('tfidf_matrix.pkl', 'rb') as f:
    X = pickle.load(f)

with open('fitted_vectorizer.pkl', 'rb') as f:
    vectorizer = pickle.load(f)

### Function that matches senators

In [13]:
def match_senators(user_input, df, vectorizer):
    # Preprocess the user's input
    user_input_preprocessed = preprocess_text(user_input)

    # Transform the preprocessed user input into a TF-IDF vector
    user_vector = vectorizer.transform([user_input_preprocessed])

    # Calculate the cosine similarity between the user's vector and each senator's vector
    similarity_scores = cosine_similarity(user_vector, X)

    # Add the similarity scores to the dataframe
    df['similarity_score'] = similarity_scores[0]

    # Sort the dataframe by similarity score
    df_sorted = df.sort_values('similarity_score', ascending=False)

    # Return the sorted dataframe
    return df_sorted

### Test function

In [14]:
user_input = "Quiero proteccion para los animales"
df_sorted = match_senators(user_input, processed_df, vectorizer)

### Results

In [15]:
df_sorted[['Apellidos', 'Nombre', 'similarity_score']]

Unnamed: 0,Apellidos,Nombre,similarity_score
1,Rojas Loreto,Estrella,0.107948
0,Botello Montes,José Alfredo,0.0
2,Moya Clemente,Roberto Juan,0.0
