# LOGEO 

### Exploring the top 5,000 words from hundreds of Spanish speaking cities

In this notebook, we are going to explore a Spanish language dataset and do many interesting things with it. 

When [Google Dataset Search](https://toolbox.google.com/datasetsearch/search?query=spanish%20twitter&docid=KjIKji0L9hTF2vQAAAAAAA%3D%3D) came out, I started looking for datasets in Spanish, to practice programming and data analysis. I found a Twitter dataset and started working with it. 

This dataset is amazing. It contains the top 5000 frequent Spanish words in Twitter for hundreds of cities in the Spanish-speaking world! 

It was created by the [Instituto Caro y Cuervo](https://www.caroycuervo.gov.co/), (a linguistics Institute from Colombia) and made available at the [Open Data Website](https://www.datos.gov.co/) of the Colombian Government.

According to them, 

> More than *250 million tweets* in Spanish from 331 Spanish-speaking cities in Latin America, Spain and the United States were compiled from Twitter. The reported data correspond to the years 2009 to 2016.



Ok, let's get to work. First, let's download the dataset at https://www.datos.gov.co/Ciencia-Tecnolog-a-e-Innovaci-n/The-top-5000-frequent-Spanish-words-in-Twitter-for/nmid-inr9. 

The dataset is a .csv file. We are going to read it using [Pandas](https://pandas.pydata.org/), a great Python library for data analysis. 

In [1]:
import pandas as pd

df = pd.read_csv('The_top-5000_frequent_Spanish_words_in_Twitter_for_331_cities_in_the_Spanish-speaking_world.csv')

One of the first things we can do is call `head()` to see the first rows. Let's inspect the dataset.

In [2]:
df.head()

Unnamed: 0,GLOBAL:WORD,GLOBAL:FREQUENCY,Argentina:Bahia_Blanca:WORD,Argentina:Bahia_Blanca:FREQUENCY,Argentina:Buenos_Aires:WORD,Argentina:Buenos_Aires:FREQUENCY,Argentina:Catamarca:WORD,Argentina:Catamarca:FREQUENCY,Argentina:Ciudad_La_Rioja:WORD,Argentina:Ciudad_La_Rioja:FREQUENCY,...,USA:New_York-NY:FREQUENCY,USA:Oakland-CA:WORD,USA:Oakland-CA:FREQUENCY,USA:Orlando-FL:WORD,USA:Orlando-FL:FREQUENCY,USA:Philadelphia-PA:WORD,USA:Philadelphia-PA:FREQUENCY,USA:Phoenix-AZ:WORD,USA:Phoenix-AZ:FREQUENCY,USA:Plano-TX:WORD
0,de,84608401,que,345119,de,1724671,que,156903,de,46324,...,238897,de,55113,de,119361,de,26833,de,33111,de
1,que,69167058,a,294770,que,1502363,de,109218,la,43095,...,182764,que,37072,que,104045,que,24060,que,29997,que
2,la,55029848,me,294326,la,1184240,me,95856,que,42660,...,163905,la,36490,a,82067,la,21562,la,23814,to
3,a,51786053,de,284870,a,1160261,a,94700,a,31749,...,145383,el,35430,la,79492,a,19689,a,22846,replying
4,y,47834089,la,265392,y,1021192,la,86979,y,27054,...,145313,a,33387,y,74993,to,18777,to,22321,a


Here, we see that the columns correspond to the cities: we have one column for the words and one column for the word frequency of each city's words. 



By querying the shape, we see that the data set has 5000 rows and 585 columns. 

In [3]:
df.shape

(5000, 585)

Although this is a dataset about cities, it does not contain geographical information, such as the location of the cities. 

By consulting other datasets and also by looking up manually the latitude and longitude of the cities, I enhanced the dataset with geographical information. You can download the enhanced dataset [here](https://drive.google.com/file/d/1gHd6N12E1YqS4GMG8HbLFYQsUfGVVXpq/view?usp=sharing). 

In [4]:
df = pd.read_csv('The_top-5000_frequent_Spanish_words_in_Twitter_for_331_cities_in_the_Spanish-speaking_world_2.csv')

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,Country,city_ascii,lat,lng,lines,freqs,ids,freq_sum
0,0,Argentina,Bahia Blanca,-38.74,-62.265,"que,a,me,de,la,y,no,el,en,se,con,mi,un,lo,es,t...","345119,294770,294326,284870,265392,214157,2053...","3838,0,2854,1196,2485,4952,3153,1467,1503,4170...",8656300
1,1,Argentina,Buenos Aires,-34.6025,-58.3975,"de,que,la,a,y,me,el,no,en,con,es,se,mi,un,lo,t...","1724671,1502363,1184240,1160261,1021192,956115...","1196,3838,2485,0,4952,2854,1467,3153,1503,994,...",41130431
2,2,Argentina,Catamarca,-28.47,-65.78,"que,de,me,a,la,no,y,el,en,mi,te,se,con,lo,es,u...","156903,109218,95856,94700,86979,84872,68712,67...","3838,1196,2854,0,2485,3153,4952,1467,1503,2917...",3273857
3,3,Argentina,Ciudadlarioja,-34.630719,-68.295192,"de,la,que,a,y,el,no,en,me,es,con,se,un,twitter...","46324,43095,42660,31749,27054,26124,24286,2137...","1196,2485,3838,0,4952,1467,3153,1503,2854,1597...",1124827
4,4,Argentina,Ciudaddecorrientes,-25.277435,-57.573031,"que,de,a,me,la,no,y,el,en,mi,te,se,con,es,lo,u...","419328,306630,263724,241012,233457,233107,2167...","3838,1196,0,2854,2485,3153,4952,1467,1503,2917...",8978312


How many countries and cities are included in the dataset? Let's find out:

In [6]:
from collections import Counter

cnt = Counter()

for country in df['Country'].tolist():
     cnt[country] += 1

print(cnt)

Counter({'Mexico': 74, 'Spain': 36, 'Colombia': 31, 'Argentina': 26, 'Chile': 24, 'USA': 24, 'Peru': 14, 'Ecuador': 10, 'Bolivia': 7, 'Guatemala': 7, 'Honduras': 7, 'Paraguay': 6, 'Panama': 5, 'Republica_Dominicana': 5, 'Costa_Rica': 4, 'Nicaragua': 4, 'El_Salvador': 3, 'Puerto_Rico': 3, 'Cuba': 1})


Great! We have 19 countries and we can see how many cities are included for each country. Mexico has the greater number of cities in this dataset, with 74 cities, while Cuba has just one city. Let's keep this in mind and let's go on.

Let's see all the cities included in the data set

In [7]:
for item in list(zip(df['Country'],df['city_ascii'])):
    print (item[0]+', '+item[1])

Argentina, Bahia Blanca
Argentina, Buenos Aires
Argentina, Catamarca
Argentina, Ciudadlarioja
Argentina, Ciudaddecorrientes
Argentina, Ciudad De Neuquen
Argentina, Ciudad De Salta
Argentina, Comodoro Rivadavia
Argentina, Concordia
Argentina, Cordoba
Argentina, Formosa
Argentina, Jujuy
Argentina, La Plata
Argentina, Mar Del Plata
Argentina, Parana
Argentina, Posadas
Argentina, Rio Cuarto
Argentina, Rosario
Argentina, San Juan
Argentina, San Luis
Argentina, San Miguel De Tucuman
Argentina, San Rafael
Argentina, Santa Rosa
Argentina, Santiago Del Estero
Argentina, Villa Nueva
Bolivia, Cochabamba
Bolivia, La Paz
Bolivia, Montero
Bolivia, Oruro
Bolivia, Santa Cruz
Bolivia, Sucre
Bolivia, Tarija
Chile, Antofagasta
Chile, Arica
Chile, Calama
Chile, Chillan
Chile, Concepcion
Chile, Copiapo
Chile, Curico
Chile, Iquique
Chile, La Serena
Chile, Linares
Chile, Los Andes
Chile, Los Angeles
Chile, Osorno
Chile, Ovalle
Chile, Puerto Mont
Chile, Punta Arenas
Chile, Quillota
Chile, Rancagua
Chile, San 

At this point, we could also visualize the cities in the dataset on a map. We will use [Folium](https://github.com/python-visualization/folium), a flexible Python library that allows us to make maps easily.

In [8]:
import folium

m = folium.Map(tiles= 'cartodbpositron')  
m

Let's add information to our map. We are going to zip 3 columns from our dataframe: `city_ascii`, (the city name), `lat` and `lng` (the latitude and longitude of each city)

In [9]:
cities = list(zip(df['city_ascii'],df['lat'],df['lng']))

Let's see the first element:

In [10]:
cities[0]

('Bahia Blanca', -38.74, -62.265)

Now, we can add all the cities to our map. We pass the latitude and longitude (`city[1]` and `city[2]` to `location`. We also pass the city name (`city[0]`) to the `popup`, to display the city's name on the Marker. 

In [11]:
for city in cities:
    folium.Marker(
    popup=city[0],
    location=[city[1],city[2]]
    ).add_to(m)
m

Nice! We get a glance at all the cities included in our dataset! Click on the markers to see the city's name. 

Spain and its islands are included and allmost all of the Latinamerican countries. We have even Spanish speaking cities from the United States. The dataset does not include information about Venezuela and Uruguay, though. 

# Let's dive into text processing

Let's put aside the map for now and let's analyze the words. We are going to use [Gensim](https://radimrehurek.com/gensim/), a Python library for Natural Language Processing.

In order to analyze the words, we need to convert them to vectors. 

In [12]:
df['lines']

0      que,a,me,de,la,y,no,el,en,se,con,mi,un,lo,es,t...
1      de,que,la,a,y,me,el,no,en,con,es,se,mi,un,lo,t...
2      que,de,me,a,la,no,y,el,en,mi,te,se,con,lo,es,u...
3      de,la,que,a,y,el,no,en,me,es,con,se,un,twitter...
4      que,de,a,me,la,no,y,el,en,mi,te,se,con,es,lo,u...
5      que,de,a,me,la,no,y,el,en,mi,se,con,es,te,lo,u...
6      que,de,la,a,y,no,me,el,en,te,se,mi,es,un,con,l...
7      que,de,a,me,la,no,y,el,en,mi,se,con,te,lo,es,u...
8      que,de,me,a,la,no,y,el,en,se,con,te,lo,es,un,m...
9      que,de,me,a,la,no,y,el,en,se,mi,con,es,lo,te,u...
10     que,de,a,me,no,la,y,el,te,en,mi,se,es,con,lo,u...
11     que,de,la,a,no,me,y,el,en,te,mi,es,se,un,lo,co...
12     que,de,a,la,me,y,el,no,en,se,con,mi,lo,un,es,t...
13     que,de,a,me,la,y,no,el,en,con,se,mi,un,lo,del,...
14     que,de,a,me,la,no,y,el,en,se,mi,con,lo,te,un,e...
15     que,de,a,la,me,y,no,el,en,mi,con,se,te,encarna...
16     que,de,a,me,la,no,y,el,con,en,te,se,mi,lo,un,e...
17     que,de,a,la,me,no,y,el,e

If you have inspected the data set, you see that we have all the country words in one column, joined by a comma (,). Let's create a new column and split this "lines" into tokens.

In [13]:
from gensim import corpora
df['tokens'] = df['lines'].str.split(',')



In [14]:
df['tokens']

0      [que, a, me, de, la, y, no, el, en, se, con, m...
1      [de, que, la, a, y, me, el, no, en, con, es, s...
2      [que, de, me, a, la, no, y, el, en, mi, te, se...
3      [de, la, que, a, y, el, no, en, me, es, con, s...
4      [que, de, a, me, la, no, y, el, en, mi, te, se...
5      [que, de, a, me, la, no, y, el, en, mi, se, co...
6      [que, de, la, a, y, no, me, el, en, te, se, mi...
7      [que, de, a, me, la, no, y, el, en, mi, se, co...
8      [que, de, me, a, la, no, y, el, en, se, con, t...
9      [que, de, me, a, la, no, y, el, en, se, mi, co...
10     [que, de, a, me, no, la, y, el, te, en, mi, se...
11     [que, de, la, a, no, me, y, el, en, te, mi, es...
12     [que, de, a, la, me, y, el, no, en, se, con, m...
13     [que, de, a, me, la, y, no, el, en, con, se, m...
14     [que, de, a, me, la, no, y, el, en, se, mi, co...
15     [que, de, a, la, me, y, no, el, en, mi, con, s...
16     [que, de, a, me, la, no, y, el, con, en, te, s...
17     [que, de, a, la, me, no,

Let's create a dictionary, with all the *unique* tokens. 

In [15]:
dictionary = corpora.Dictionary(df['tokens'])

By calling "print" on the dictionary, we see that we have 62314 unique tokens:

In [16]:
print(dictionary)

Dictionary(62314 unique tokens: ['a', 'aa', 'aaa', 'aaaa', 'aaaaa']...)


Now, let's actually convert the tokens to vectors:

> The function doc2bow() simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector.

In [17]:
df['bow'] = df['tokens'].apply(lambda x: dictionary.doc2bow(x))

In [18]:
df['bow']

0      [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1...
1      [(0, 1), (1, 1), (2, 1), (3, 1), (9, 1), (14, ...
2      [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1...
3      [(0, 1), (1, 1), (2, 1), (7, 1), (9, 1), (10, ...
4      [(0, 1), (1, 1), (2, 1), (9, 1), (10, 1), (11,...
5      [(0, 1), (1, 1), (2, 1), (6, 1), (7, 1), (8, 1...
6      [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (7, 1...
7      [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (7, 1...
8      [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (6, 1...
9      [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (7, 1...
10     [(0, 1), (2, 1), (3, 1), (4, 1), (6, 1), (7, 1...
11     [(0, 1), (1, 1), (2, 1), (3, 1), (9, 1), (10, ...
12     [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (7, 1...
13     [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1...
14     [(0, 1), (1, 1), (2, 1), (3, 1), (9, 1), (10, ...
15     [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (7, 1...
16     [(0, 1), (1, 1), (2, 1), (3, 1), (7, 1), (8, 1...
17     [(0, 1), (1, 1), (2, 1),

Now, let's transform the dataset from one vector representation into another.

According to Gensim's "Topics and Transformations" tutorial:

>This process serves two goals:
>
>To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way.
To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction).



# TFIDF

What is tf-idf? 

> In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1] It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. Tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf.[2]

Creating a transformation

The transformations are standard Python objects, typically initialized by means of a training corpus:

In [19]:
from gensim import models
tfidf = models.TfidfModel(df['bow']) 

> From now on, tfidf is treated as a read-only object that can be used to convert any vector from the old representation (bag-of-words integer counts) to the new representation (TfIdf real-valued weights):

Let's apply this transformation to our bag of words column and create a new tfidf column

In [20]:
df['tfidf'] = df['bow'].apply(lambda x: tfidf[x]) 

#  Similarity
A common reason for this transformations is to determine similarity between pairs of documents, or the similarity between a specific document and a set of other documents (such as a user query vs. indexed documents).

Latent Semantic Indexing, LSI (or sometimes LSA) transforms documents from either bag-of-words or (preferrably) TfIdf-weighted space into a latent space of a lower dimensionality. Let's transform our tf-idf vectors into LSI vectors:

In [21]:
lsi = models.LsiModel(df['tfidf'],num_topics=200)

To prepare for similarity queries, we need to enter all documents which we want to compare against subsequent queries:

In [22]:
from gensim import similarities
index = similarities.MatrixSimilarity(lsi[df['tfidf']]) 

# Query

Let's use the lyrics of a Colombian rap group as a query example:

In [23]:
query = 'Somos pacífico, estamos unidos Nos une la región La pinta, la raza y el don del sabor Somos pacífico, estamos unidos Nos une la región La pinta, la raza y el don del sabor  Ok! si por si acaso usted no conoce En el pacífico hay de todo para que goce Cantadores, colores, buenos sabores Y muchos santos para que adores Es toda una conexión Con un corrillo chocó, valle, cauca Y mis paisanos de nariño Todo este repertorio me produce orgullo Y si somos tantos Porque estamos tan al cucho (en la esquina) Bueno, dejemos ese punto a un lado Hay gente trabajando pero son contados Allá rastrillan, hablan jerguiados Te preguntan si no has janguiado (hanging out) Si estas queda’o Si lo has copiado, lo has vacilado Si dejaste al que está malo o te lo has rumbeado Hay mucha calentura en buenaventura Y si sos chocoano sos arrecho por cultura, ey!  Somos pacífico, estamos unidos Nos une la región La pinta, la raza y el don del sabor Somos pacífico, estamos unidos Nos une la región La pinta, la raza y el don del sabor Unidos por siempre, por la sangre, el color Y hasta por la tierra No hay quien se me pierda Con un vínculo familiar que aterra Característico en muchos de nosotros Que nos reconozcan por la mamá Y hasta por los rostros Étnicos, estilos que entre todos se ven La forma de caminar  El cabello y hasta por la piel Y dime quién me va a decir que no Escucho hablar de san pacho Mi patrono allá en quibdo, ey! Donde se ven un pico y juran que fue un beso Donde el manjar al desayuno es el plátano con queso Y eso que no te he hablado de buenaventura Donde se baila el currulao, salsa poco pega’o Puerto fiel al pescado Negras grandes con gran tumba’o Donde se baila aguabajo y pasillo  En el lado del río (ritmo folclórico) Con mis prietillos Somos pacífico, estamos unidos Nos une la región La pinta, la raza y el don del sabor Somos pacífico, estamos unidos Nos une la región La pinta, la raza y el don del sabor Es del pacífico, guapi, timbiquí, tumaco El bordo cauca Seguimos aquí con la herencia africana Más fuerte que antes Llevando el legado a todas partes De forma constante Expresándonos a través de lo cultural Música, artes plástica, danza en general Acento golpia’o al hablar El 1, 2,3 al bailar Después de eso seguro hay muchísimo más Este es pacífico colombiano Una raza un sector Lleno de hermanas y hermanos Con nuestra bámbara y con el caché (bendición, buen espíritu) Venga y lo ve usted mismo Pa vé como es, y eh! Piense en lo que se puede perder, y eh! Pura calentura y yenyeré, y eh!'.split()

We need to convert our query to the same vector space of our indexed corpus. First, let's transform it to bag of words:

In [24]:
query_bow = dictionary.doc2bow(query)

In [25]:
query_bow

[(0, 4),
 (149, 6),
 (190, 1),
 (263, 1),
 (301, 1),
 (417, 2),
 (419, 1),
 (482, 1),
 (579, 1),
 (589, 1),
 (655, 2),
 (692, 1),
 (934, 1),
 (964, 1),
 (994, 4),
 (1007, 1),
 (1181, 1),
 (1196, 9),
 (1210, 1),
 (1239, 1),
 (1247, 8),
 (1276, 1),
 (1346, 1),
 (1394, 6),
 (1467, 14),
 (1503, 5),
 (1575, 1),
 (1597, 2),
 (1629, 1),
 (1631, 2),
 (1663, 7),
 (1673, 1),
 (1674, 1),
 (1701, 1),
 (1773, 1),
 (1817, 1),
 (1850, 2),
 (1884, 1),
 (1893, 1),
 (1949, 1),
 (1954, 1),
 (2000, 1),
 (2002, 1),
 (2060, 1),
 (2062, 2),
 (2111, 4),
 (2112, 3),
 (2113, 3),
 (2118, 1),
 (2128, 1),
 (2132, 1),
 (2485, 18),
 (2494, 2),
 (2653, 6),
 (2678, 1),
 (2746, 1),
 (2753, 1),
 (2854, 3),
 (2975, 2),
 (2978, 1),
 (3033, 1),
 (3039, 2),
 (3040, 1),
 (3071, 1),
 (3153, 4),
 (3181, 1),
 (3184, 1),
 (3203, 1),
 (3217, 1),
 (3272, 1),
 (3300, 2),
 (3341, 2),
 (3370, 1),
 (3528, 1),
 (3543, 1),
 (3560, 1),
 (3572, 1),
 (3578, 1),
 (3644, 1),
 (3683, 8),
 (3713, 1),
 (3804, 1),
 (3817, 1),
 (3838, 10),
 (3876

Now, let's transform it to lsi:

In [26]:
query_lsi = lsi[query_bow] 

Let's find out the similarity scores of this query against all our cities in the corpus:

In [27]:
sim_scores = index[query_lsi]

In [28]:
df['sim_score'] = sim_scores

In [29]:
df = df.dropna()

Let's sort our dataframe by similarity score and keep the top 10 cities

In [30]:
df = df.sort_values(by=['sim_score'], ascending=False)[:10]

Let's finally see which cities are the most similar to our query!

In [40]:
similar_cities = list(zip(df['city_ascii'],df['lat'],df['lng'],df['sim_score']))
for city in similar_cities:
    print(city)

('Cali', 3.4, -76.5, 0.5546272993087769)
('Medellin', 6.275, -75.575, 0.4947495460510254)
('Quibdo', 5.6904, -76.66, 0.4928922653198242)
('Buenaventura', 3.889934, -77.07860500000002, 0.4682047367095947)
('Tulua', 4.086446, -76.197138, 0.4558314085006714)
('Bogota', 4.5964, -74.0833, 0.4548640847206116)
('Pasto', 1.2136, -77.2811, 0.41932082176208496)
('Popayan', 2.42, -76.61, 0.4132636487483978)
('Cartagena', 10.3997, -75.5144, 0.4044770300388336)
('Monteria', 8.7575, -75.89, 0.4038309156894684)


This is nice! Our similarity model detected correctly that our query (a rap song from Colombia) is indeed from Colombia! We got a list with the most similar cities and their similarity scores

Finally, let's see the results on our map:

In [41]:
m = folium.Map(tiles= 'cartodbpositron')  

for city in similar_cities:
    folium.Marker(
    popup=city[0],
    location=[city[1],city[2]]
    ).add_to(m)

m.fit_bounds(m.get_bounds())

m