# Bag of Words con N-gramas

### Hay palabras que cobran sentido cuando se las agrupa con otras, como por ejemplo: `Plaza Italia` y `Control Remoto.` Este tipo de palabras se conocen como bigramas, existen unigramas, trigramas, etc.

In [7]:
with open("review.txt", "r") as file:
    documents = file.read().splitlines()
    
print(documents)

["I like this movie, it's funny.", 'I hate this movie.', 'This was awesome! I like it.', 'Nice one. I love it.']


## Solamente usando Unigramas

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

count_vectorizer = CountVectorizer()
BoW_ngrams = count_vectorizer.fit_transform(documents)
feature_names = count_vectorizer.get_feature_names()
pd.DataFrame(BoW_ngrams.toarray(), columns = feature_names)

Unnamed: 0,awesome,funny,hate,it,like,love,movie,nice,one,this,was
0,0,1,0,1,1,0,1,0,0,1,0
1,0,0,1,0,0,0,1,0,0,1,0
2,1,0,0,1,1,0,0,0,0,1,1
3,0,0,0,1,0,1,0,1,1,0,0


## Solamente usando Bigramas

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

count_vectorizer = CountVectorizer(ngram_range=(2,2))
BoW_ngrams = count_vectorizer.fit_transform(documents)
feature_names = count_vectorizer.get_feature_names()
pd.DataFrame(BoW_ngrams.toarray(), columns = feature_names)

Unnamed: 0,awesome like,hate this,it funny,like it,like this,love it,movie it,nice one,one love,this movie,this was,was awesome
0,0,0,1,0,1,0,1,0,0,1,0,0
1,0,1,0,0,0,0,0,0,0,1,0,0
2,1,0,0,1,0,0,0,0,0,0,1,1
3,0,0,0,0,0,1,0,1,1,0,0,0


## Usando Unigramas y Bigramas

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

count_vectorizer = CountVectorizer(ngram_range=(1,2))
BoW_ngrams = count_vectorizer.fit_transform(documents)
feature_names = count_vectorizer.get_feature_names()
pd.DataFrame(BoW_ngrams.toarray(), columns = feature_names)

Unnamed: 0,awesome,awesome like,funny,hate,hate this,it,it funny,like,like it,like this,...,movie it,nice,nice one,one,one love,this,this movie,this was,was,was awesome
0,0,0,1,0,0,1,1,1,0,1,...,1,0,0,0,0,1,1,0,0,0
1,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
2,1,1,0,0,0,1,0,1,1,0,...,0,0,0,0,0,1,0,1,1,1
3,0,0,0,0,0,1,0,0,0,0,...,0,1,1,1,1,0,0,0,0,0


## Resulta útil considerar palabras compuestas, pero tampoco la pavada...

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

count_vectorizer = CountVectorizer(ngram_range=(1,5))
BoW_ngrams = count_vectorizer.fit_transform(documents)
feature_names = count_vectorizer.get_feature_names()

<img src='https://candidmanmx.files.wordpress.com/2015/04/20150403-todo-en-exceso-es-malo-excepto-las-vacaciones-candidman.jpg'>

In [13]:
pd.DataFrame(BoW_ngrams.toarray(), columns = feature_names)

Unnamed: 0,awesome,awesome like,awesome like it,funny,hate,hate this,hate this movie,it,it funny,like,...,this movie it,this movie it funny,this was,this was awesome,this was awesome like,this was awesome like it,was,was awesome,was awesome like,was awesome like it
0,0,0,0,1,0,0,0,1,1,1,...,1,1,0,0,0,0,0,0,0,0
1,0,0,0,0,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,1,0,1,...,0,0,1,1,1,1,1,1,1,1
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
