# 1-gram
First we load the documents

In [23]:
import requests
import spacy
nlp = spacy.load('en_core_web_sm')

url = ('https://gitlab.com/tangibleai/nlpia2/-/raw/main/src/nlpia2/ch03/bias_intro.txt')
response = requests.get(url)

bias_intro = response.text
docs = [nlp(s) for s in bias_intro.split('\n') if s.strip()]

Vectorize the documents

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [doc.text for doc in docs]
for doc in docs:
    print(doc)

vectorizer = CountVectorizer()
count_vectors = vectorizer.fit_transform(corpus)

Algorithmic bias describes systematic and repeatable errors in a computer system that create unfair outcomes, such as privileging one arbitrary group of users over others.
Bias can emerge due to many factors, including but not limited to the design of the algorithm or the unintended or unanticipated use or decisions relating to the way data is coded, collected, selected or used to train the algorithm.
Algorithmic bias is found across platforms, including but not limited to search engine results and social media platforms, and can have impacts ranging from inadvertent privacy violations to reinforcing social biases of race, gender, sexuality, and ethnicity.
The study of algorithmic bias is most concerned with algorithms that reflect "systematic and unfair" discrimination.
This bias has only recently been addressed in legal frameworks, such as the 2018 European Union's General Data Protection Regulation.
More comprehensive regulation is needed as emerging technologies become increasingly

Vectorize the document we want to search for

In [7]:
question = 'What is algorithmic bias?'
question_vec = vectorizer.transform([question])
question_vec.toarray()

array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int64)

Perform cosine similarity against the entire corpus.

In [21]:
from sklearn.metrics.pairwise import cosine_similarity

similar_docs = cosine_similarity(count_vectors, question_vec)
print(similar_docs)

most_similar_doc_index = similar_docs.argmax()
most_similar_doc = docs[most_similar_doc_index]

print('Question: ', question)
print('Match: ', most_similar_doc)

[[0.23570226]
 [0.12451456]
 [0.24743583]
 [0.4330127 ]
 [0.12909944]
 [0.16012815]
 [0.        ]
 [0.        ]
 [0.1490712 ]
 [0.27216553]
 [0.        ]
 [0.        ]
 [0.24077171]
 [0.14002801]
 [0.        ]
 [0.09128709]]
Question:  What is algorithmic bias?
Match:  The study of algorithmic bias is most concerned with algorithms that reflect "systematic and unfair" discrimination.


Not a bad result, but the first sentence of the corpus is probably a better match. Let's see how n-grams can help improve the precision.

# 2-grams

In [31]:
ngram_vectorizer = CountVectorizer(ngram_range=(1, 2))
ngram_vectors = ngram_vectorizer.fit_transform(corpus)
ngram_vectors

<16x616 sparse matrix of type '<class 'numpy.int64'>'
	with 772 stored elements in Compressed Sparse Row format>

In [32]:
ngram_question_vec = ngram_vectorizer.transform([question])

similar_docs = cosine_similarity(ngram_vectors, ngram_question_vec)

most_similar_doc_index = similar_docs.argmax()
most_similar_doc = docs[most_similar_doc_index]

print('Question: ', question)
print('Match: ', most_similar_doc)

Question:  What is algorithmic bias?
Match:  The study of algorithmic bias is most concerned with algorithms that reflect "systematic and unfair" discrimination.
