# TEXT VECTORIZATION
__Solution made by: Ayoub Nainia__
* Note: This notebook has been deployed from `Colab` as a web app with `Streamlit` via `Ngrok`. 
* Currently working on solving dependency issues to deploy it independently with `Streamlit Sharing`.

In [None]:
# !pip install -U textblob
# !python -m textblob.download_corpora


In [None]:
import pandas as pd
import numpy as np
import spacy
from textblob import TextBlob
import nltk
import re
import math

# Exercice 1

In [None]:
documents = [
    "You are trying to code TF-IDF all by yourself like a big girl/boy.",
    "So this is a tinny doc.",
    "And another tinny doc to test few stuff.",
    "So in total, we are four documents, have fun ;)."
]

1. Ecrire une fonction TF, qui reçoit un mot w et un document d et calcule TF(w,d)

In [None]:
def TF(w, d):
  # lowercase
  d = d.lower()
  w = w.lower()

  # remove special characters
  tkn = re.sub('[^a-z]+', ' ', d)
  # tokenizing text
  tkn = tkn.split()
  occ = d.count(w)
  return occ/len(tkn)

In [None]:
print(TF('boy', documents[0]))

0.06666666666666667


2. Ecrire une fonction IDF, qui reçoit un mot w et une collection de documents D et calcule IDF(w,D). 

In [None]:
def IDF(w, d):
  # to lowercase
  w = w.lower()
  
  # Number of all docs
  nDocs = len(d)

  # Number of docs containing w (with TextBlob)
  nbDocW = [re.sub('[^a-z]+', ' ', doc.lower()).count(w) for doc in d if re.sub('[^a-z]+', ' ', doc.lower()).count(w) != 0]


  if len(nbDocW) != 0:
    return math.log((nDocs / len(nbDocW)), 10)
  else:
    return 0


In [None]:
print(IDF('boy', documents))

0.6020599913279623


3. Ecrire une fonction TF-IDF qui prend en entrée un mot w, un document d et une collection D et calcule TF-IDF(w,d,D). Utiliser TextBlob pour la tokenization. 

In [None]:
def TF_IDF(w, d, C):
  
  tf = TF(w, d)
  idf = IDF(w, C)

  return tf * idf

4. Quelles sont les valeurs TF, IDF et TF-IDF pour le mot « boy » du document 1 ?

In [None]:
mot = "boy"
print("TF: ", TF(mot, documents[0]))
print("IDF: ", IDF(mot, documents))
print("TF-IDF: ", TF_IDF(mot, documents[0], documents))

TF:  0.06666666666666667
IDF:  0.6020599913279623
TF-IDF:  0.040137332755197486


5. Créer et afficher la matrice term-document, pour les 4 documents ci- dessous, en utilisant la fonction TF-IDF(w,d,D) pour les poids et les mots pour les attributs. La matrice ne doit contenir aucune valeur « NaN ». Convertir la matrice obtenue au format document-term. 

In [None]:
# Getting unique tokens from the documents collection
def get_words(documents):
  collection = [re.sub('[^a-z]+', ' ', i.lower()) for i in documents]
  words = []
  for doc in collection:
    for word in doc.split():
      if word not in words and word not in ['the', 'a' ,'an']:
        words.append(word)

  return words

In [None]:
def term_document_mat(documents):
  documents = [re.sub('[^a-z]+', ' ', i.lower()) for i in documents]
  words = get_words(documents)
  cols_TFIDF = []
  for i in documents:
    cols_TFIDF.append([TF_IDF(k, i, documents) for k in words])

  return cols_TFIDF


In [None]:
# initializing the column headers (words) and TF-IDF values (weight)
doc_words = get_words(documents)
mat_tfidf = term_document_mat(documents)

In [None]:
# Create dataframe
df = pd.DataFrame(mat_tfidf)
# Add data frame header
df.columns = doc_words
# transpose of the dataframe
df = df.transpose()
# add documents headers
df.columns = ['Document 1', 'Document 2', 'Document 3', 'Document 4']
df

Unnamed: 0,Document 1,Document 2,Document 3,Document 4
you,0.080275,0.0,0.0,0.0
are,0.020069,0.0,0.0,0.033448
trying,0.040137,0.0,0.0,0.0
to,0.008329,0.0,0.015617,0.013882
code,0.040137,0.0,0.0,0.0
tf,0.040137,0.0,0.0,0.0
idf,0.040137,0.0,0.0,0.0
all,0.040137,0.0,0.0,0.0
by,0.040137,0.0,0.0,0.0
yourself,0.040137,0.0,0.0,0.0


6. Créer et aﬃcher la matrice term-document pour les même documents, mais cette fois-ci en utilisant TfidfVectorizer de sklearn.feature_extraction.text

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tdf = vectorizer.fit_transform(documents)
finalDF = pd.DataFrame(tdf.toarray(), columns = vectorizer.get_feature_names())
finalDF = finalDF.transpose()
finalDF.columns = ['Document 1', 'Document 2', 'Document 3', 'Document 1']
finalDF

Unnamed: 0,Document 1,Document 2,Document 3,Document 1.1
all,0.274792,0.0,0.0,0.0
and,0.0,0.0,0.381669,0.0
another,0.0,0.0,0.381669,0.0
are,0.216649,0.0,0.0,0.274603
big,0.274792,0.0,0.0,0.0
boy,0.274792,0.0,0.0,0.0
by,0.274792,0.0,0.0,0.0
code,0.274792,0.0,0.0,0.0
doc,0.0,0.401043,0.300912,0.0
documents,0.0,0.0,0.0,0.348299


8. Importer le corpus shakespeare de NLTK et créer la matrice term-document en utilisant
TF-IDF pour le poids et des bigrams pour les attributs.

In [None]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipp

True

In [None]:
import nltk

shakes = [
    nltk.corpus.gutenberg.raw('shakespeare-caesar.txt'),
    nltk.corpus.gutenberg.raw('shakespeare-hamlet.txt'),
    nltk.corpus.gutenberg.raw('shakespeare-macbeth.txt')
]

In [None]:
# Clean data
def clean_shakes(shakesData):
  text = []
  for d in shakesData:
    data = re.sub('[^a-zA-Z]+', ' ', d.lower())
    text.append(data) 
  return text

cleaned = clean_shakes(shakes)
print(cleaned)



In [None]:
# term-document matrix
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(2,2))
tdfShakes = vectorizer.fit_transform(cleaned)
dfSahkes = pd.DataFrame(tdfShakes.toarray(), columns = vectorizer.get_feature_names())
dfSahkes = dfSahkes.transpose()
dfSahkes.columns = ['shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt']
dfSahkes

Unnamed: 0,shakespeare-caesar.txt,shakespeare-hamlet.txt,shakespeare-macbeth.txt
abhominably play,0.000000,0.003801,0.000000
abhorred my,0.000000,0.003801,0.000000
abhorred tyrant,0.000000,0.000000,0.006533
abide it,0.005466,0.000000,0.000000
abide no,0.000000,0.000000,0.006533
...,...,...,...
youth you,0.000000,0.003801,0.000000
youthfull season,0.005466,0.000000,0.000000
youths and,0.005466,0.000000,0.000000
youths that,0.000000,0.000000,0.006533


# Exercice 2

1. Importer Spacy et en utilisant spacy.load charger le modèle en_core_web_sm.

In [None]:
import spacy 
en_model = spacy.load("en_core_web_sm")

2. Considérer le premier paragraphe de l’URL 1 comme document. Afficher le lemma, le POS et le dependency tag pour chaque token de ce document

In [None]:
text = "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of \"understanding\" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."
nlpDocument = en_model(text)

In [None]:
# Token - Lemma - POS - 
text = []
lemma = []
pos = []
tag = []
for token in nlpDocument:
  text.append(token.text)
  lemma.append(token.lemma_)
  pos.append(token.pos_)
  tag.append(token.tag_)
  
data = {"Token": text, "Lemma": lemma, "POS": pos, "Dependency Tag": tag}
df = pd.DataFrame(data)
df

Unnamed: 0,Token,Lemma,POS,Dependency Tag
0,Natural,natural,ADJ,JJ
1,language,language,NOUN,NN
2,processing,processing,NOUN,NN
3,(,(,PUNCT,-LRB-
4,NLP,NLP,PROPN,NNP
...,...,...,...,...
88,organize,organize,VERB,VB
89,the,the,DET,DT
90,documents,document,NOUN,NNS
91,themselves,-PRON-,PRON,PRP


3. En utilisant from spacy import displacy, afficher l’arbre de dépendance du document.

In [None]:
!pip install spacy

In [None]:
from spacy import displacy

displacy.render(en_model(documents[0]), style='dep', jupyter=True, options={'distance': 90})