<a href="https://colab.research.google.com/github/jroots7/RAI/blob/master/RaioMcQueen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Descomprimimos el ZIP y cogemos los nombres de los ficheros**

In [1]:
# importing required modules 
from zipfile import ZipFile 

# specifying the zip file name 
file_name = "Ficheros(html).zip"

# opening the zip file in READ mode 
with ZipFile(file_name, 'r') as zip: 
	# extracting the names of the files in the zip
	file_names = zip.namelist()
	# extracting all the files 
	print('Extracting all the files now...') 
	zip.extractall() 
	print('Done!') 


Extracting all the files now...
Done!


**Leemos los ficheros y limpiamos los datos**

In [2]:
import re
clean_data = []
print('Cleaning raw data...')
for file_name in file_names:
  #f = open("2010-42-103.html", "r")
  #print(f.read())
  with open(file_name, 'r') as file:
      rawdata = file.read().replace('\n', '')
      clean_script = re.compile('<script.*?</script>')
      clean_script_data = re.sub(clean_script, '', rawdata)
      clean_htmltags = re.compile('<.*?>')
      clean_htmltags_data = re.sub(clean_htmltags, ' ', clean_script_data)
      clean_data.append(re.sub('\s+',' ',clean_htmltags_data))
print('Done!')


Cleaning raw data...
Done!


**Creamos el corpus de todos los documentos**


In [3]:
corpus = clean_data
corpus

[' Gerard Salton: Facts, Discussion Forum, and Encyclopedia Article Home &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Discussion &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Topics &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Dictionary &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Almanac Signup &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Login Gerard Salton Gerard Salton Topic Home Discussion Discussion Ask a question about \' Gerard Salton \' Start a new discussion about \' Gerard Salton \' Answer questions from other users Full Discussion Forum &nbsp; Encyclopedia Gerard Salton (8 March, 1927&nbsp;in Nuremberg Nuremberg Nuremberg is a city in the German state of Bavaria, in the administrative region of Middle Franconia. It is situated on the Pegnitz river and the Rhine-Main-Danube Canal and is Franconia\'s largest city. It is located about 170 kilometres north of Munich, at 49.27° N 11.5° E. The population is... &nbsp;- 28 August, 1995) was a Professor of Computer Science Computer science Computer science is the study of the theoretical foundat

**Inicializamos CountVectorizer y tokenizamos el corpus**

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
#tokenizamos
matriz_tf = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names()

In [5]:
#Matriz con las ocurrencias de los tokens
matriz_tf.toarray()

array([[0, 4, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 2],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0]])

In [6]:
#Ejemplo de como analiza cada documento del corpus
analyze = vectorizer.build_analyzer()
for documento in corpus: print(analyze(documento))

['gerard', 'salton', 'facts', 'discussion', 'forum', 'and', 'encyclopedia', 'article', 'home', 'nbsp', 'nbsp', 'nbsp', 'nbsp', 'nbsp', 'discussion', 'nbsp', 'nbsp', 'nbsp', 'nbsp', 'nbsp', 'topics', 'nbsp', 'nbsp', 'nbsp', 'nbsp', 'nbsp', 'dictionary', 'nbsp', 'nbsp', 'nbsp', 'nbsp', 'nbsp', 'almanac', 'signup', 'nbsp', 'nbsp', 'nbsp', 'nbsp', 'nbsp', 'nbsp', 'login', 'gerard', 'salton', 'gerard', 'salton', 'topic', 'home', 'discussion', 'discussion', 'ask', 'question', 'about', 'gerard', 'salton', 'start', 'new', 'discussion', 'about', 'gerard', 'salton', 'answer', 'questions', 'from', 'other', 'users', 'full', 'discussion', 'forum', 'nbsp', 'encyclopedia', 'gerard', 'salton', 'march', '1927', 'nbsp', 'in', 'nuremberg', 'nuremberg', 'nuremberg', 'is', 'city', 'in', 'the', 'german', 'state', 'of', 'bavaria', 'in', 'the', 'administrative', 'region', 'of', 'middle', 'franconia', 'it', 'is', 'situated', 'on', 'the', 'pegnitz', 'river', 'and', 'the', 'rhine', 'main', 'danube', 'canal', 'an

In [7]:
#Ejemplo bigrama (mas semantica)
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), 
                                    token_pattern=r'\b\w+\b', min_df=1)
analyze = bigram_vectorizer.build_analyzer()
analyze('Fucking Jim man!')

['fucking', 'jim', 'man', 'fucking jim', 'jim man']

In [0]:
bmatriz_tf = bigram_vectorizer.fit_transform(corpus)
bigram_vectorizer.get_feature_names()

**Creamos las queries y las tokenizamos**

In [8]:
query = [
    "What video game won Spike's best driving game award in 2006?"
]
query

["What video game won Spike's best driving game award in 2006?"]

In [9]:
query_tf = vectorizer.transform(query)
query_tf.toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

**Analizamos la similitud de la query con los documentos**

In [10]:
#Similitud por Producto Escalar TF

num_files = matriz_tf.get_shape()[0]
q = query_tf.toarray().flatten()
scalar_prod_TF = []
for i in range(num_files):
  doc = matriz_tf.getrow(i).toarray().flatten()
  scalar_prod_TF.append(q @ doc )
scalar_prod_TF

[32, 4, 265, 7, 85]

In [11]:
#Similitud por Coseno TF

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(query_tf, matriz_tf)

array([[0.07960544, 0.01758297, 0.25982465, 0.04103896, 0.13234642]])

**Inicializamos TfidfVectorizer (distintos pesos de los tokens) y tokenizamos el corpus**


In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
matriz_tfidf = tfidf_vectorizer.fit_transform(corpus)
#pesos de los tokens en cada documento
matriz_tfidf.toarray()

array([[0.        , 0.04673153, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.02424865,
        0.0484973 ],
       [0.        , 0.        , 0.0057586 , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.02683986, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.00777997, 0.        , 0.        , ..., 0.00777997, 0.        ,
        0.        ]])

In [13]:
#peso de cada token en total
tfidf_vectorizer.idf_

array([2.09861229, 1.69314718, 2.09861229, ..., 2.09861229, 2.09861229,
       2.09861229])

In [14]:
tfidf_vectorizer.get_feature_names()

['00',
 '000',
 '06',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '1636',
 '169',
 '170',
 '19',
 '1927',
 '1930',
 '1947',
 '1949',
 '1950',
 '1952',
 '1954',
 '1957',
 '1958',
 '1960s',
 '1965',
 '1975',
 '1983',
 '1989',
 '1995',
 '1996',
 '1997',
 '1999',
 '200',
 '2000',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '20th',
 '21',
 '22',
 '24',
 '255',
 '27',
 '28',
 '29',
 '30',
 '3082243',
 '3082322',
 '30may09',
 '31st',
 '33',
 '35',
 '360',
 '39',
 '41',
 '42',
 '46',
 '49',
 '77',
 '80',
 '8216',
 '8217',
 '8220',
 '8221',
 '84',
 '92',
 'aback',
 'ability',
 'about',
 'above',
 'absolutely',
 'absorb',
 'academic',
 'academy',
 'accept',
 'accepting',
 'accepts',
 'access',
 'accessories',
 'accidentally',
 'acclaim',
 'account',
 'accounting',
 'accredited',
 'accumulate',
 'accumulating',
 'achievement',
 'acm',
 'acquired',
 'across',
 'activate',
 'activities',
 'activity',
 'actor',
 'actress',
 'actually',
 'adapter',
 'add',


**Transformamos la query en un vector**

In [15]:
query_tfidf = tfidf_vectorizer.transform(query)
query_tfidf.toarray()

array([[0., 0., 0., ..., 0., 0., 0.]])

**Analizamos la similitud de la query con los documentos**

In [16]:
#Similitud por Producto Escalar TF IDF

num_files = matriz_tfidf.get_shape()[0]

# Calculo de tf*idf de las palabras de la query 
q_tfidf = query_tf.toarray().flatten() * tfidf_vectorizer.idf_.flatten()  

scalar_prod_TFIDF = []
for i in range(num_files):
  doc_tfidf = matriz_tf.getrow(i).toarray().flatten() * tfidf_vectorizer.idf_.flatten()
  scalar_prod_TFIDF.append(q_tfidf @ doc_tfidf )
scalar_prod_TFIDF

[37.672429377866834,
 4.0,
 674.5467239631978,
 13.150606405687153,
 163.48554785416889]

In [17]:
#Similitud por Coseno TF IDF

cosine_similarity(query_tfidf, matriz_tfidf)

array([[0.04379845, 0.00778745, 0.31187222, 0.03512459, 0.10211862]])