# La Mancha Project: Get to Know Your Region

This is the final project for the course "Programming Techniques for NLP". In this notebook we are going to explore data related to all the personalities born in the province of Ciudad Real (Spain) throughout history and with an entry in Wikidata. 

First of all, let us import all the necessary libraries for this project.

In [1]:
import pandas as pd
from lamancha_utils import get_zodiac_sign, get_stats

C:\Users\usuario\miniconda3\envs\master\Lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll
C:\Users\usuario\miniconda3\envs\master\Lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-246-g3d31191b-gcc_10_3_0.dll


In [2]:
# Set options to display all columns and remove column width limitations
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

In [48]:
df = pd.read_csv("C:\\Users\\usuario\\Desktop\\ehu_python\\5_WikiData\\lamancha_project\\lamancha.csv")

Let's have a look at the last five entries of our dataset.

In [18]:
df.tail(5)

Unnamed: 0,person,genderLabel,personLabel,birthDate,placeBirthLabel,description
501,http://www.wikidata.org/entity/Q17413239,masculino,Evaristo Martín Freire,1904-06-29T00:00:00Z,Piedrabuena,farmacéutico español
502,http://www.wikidata.org/entity/Q17631650,masculino,Tomás Aránguez,1942-01-01T00:00:00Z,Brazatortas,empresario español
503,http://www.wikidata.org/entity/Q18812414,masculino,Antonio Manuel Sarmiento García,1979-01-27T00:00:00Z,Villarrubia de los Ojos,futbolista español
504,http://www.wikidata.org/entity/Q18812414,masculino,Antonio Manuel Sarmiento García,1979-07-27T00:00:00Z,Villarrubia de los Ojos,futbolista español
505,http://www.wikidata.org/entity/Q20005385,masculino,Manuel Adriano Arabid Cantos,1908-09-08T00:00:00Z,Herencia,político español


In [5]:
df.shape

(506, 6)

In this first peak we see we have six columns and 506 rows. Noticeably, we observe the dataset has duplicates. In this last five entries, it seems that one of the personalities have two different birth date associated. Let us explore the duplicates in our dataset:

In [40]:
duplicate_names = df[df.duplicated(subset=['personLabel'], keep=False)]
duplicate_names

Unnamed: 0,person,genderLabel,personLabel,birthDate,placeBirthLabel,description
9,"http://www.wikidata.org/entity/Q32442,masculino,Ángel Ayala,1867-03-01T00:00:00Z,Ciudad Real,""clérigo jesuita, pedagogo y propagandista católico español""",,,,,
13,"http://www.wikidata.org/entity/Q1690876,masculino,Joaquín Araujo Ruano,1851-01-01T00:00:00Z,Ciudad Real,""pintor, grabador e ilustrador español""",,,,,
31,"http://www.wikidata.org/entity/Q5864996,masculino,Francisco Aguilera y Egea,1857-03-31T00:00:00Z,Ciudad Real,""Ciudad Real, 21.XII.1857 – Madrid, 20.V.1931. Capitán general del Ejército, ministro de la Guerra, senador vitalicio""",,,,,
40,"http://www.wikidata.org/entity/Q5495690,masculino,Diego Medrano y Treviño,1784-11-13T00:00:00Z,Ciudad Real,""Militar, político y ensayista español (1784-1853)""",,,,,
67,http://www.wikidata.org/entity/Q6173045,masculino,Ángel Andrade,1866-03-15T00:00:00Z,Ciudad Real,pintor español
68,http://www.wikidata.org/entity/Q6173045,masculino,Ángel Andrade,1866-05-15T00:00:00Z,Ciudad Real,pintor español
91,"http://www.wikidata.org/entity/Q28124043,masculino,José Núñez de Arenas,1784-07-03T00:00:00Z,Ciudad Real,""político, militar y matemático español""",,,,,
93,http://www.wikidata.org/entity/Q793631,masculino,Bernardo de Balbuena,1568-01-01T00:00:00Z,Valdepeñas,Poeta y eclesiástico novohispano
94,http://www.wikidata.org/entity/Q793631,masculino,Bernardo de Balbuena,1568-11-30T00:00:00Z,Valdepeñas,Poeta y eclesiástico novohispano
102,http://www.wikidata.org/entity/Q9003232,masculino,Hernán Pérez del Pulgar,1415-01-01T00:00:00Z,Ciudad Real,militar español


In [7]:
duplicate_names.shape 

(68, 6)

We got an important insight in our dataset. Now we know we have two different types of duplicates:
* Duplicates due to a different birth dates assigned
* Duplicates due to a bad encoding leading to NaN values. This happens when the entry has a description separated by commas.

we will deal with de "personLabel" duplicates. It seems that some entries have an slightly differente description or an entry has two birth dates assigned. Because this project has not an investigative scope, we will randombly delete the second duplicate.

In [8]:
df.drop_duplicates(subset=['personLabel'], keep="first", inplace=True)
df.shape

(453, 6)

In [9]:
df.shape

(453, 6)

In [10]:
print(f"Total rows: {df.shape[0]}")
print(f"Total columns: {df.shape[1]}")
print(f"Name columns: {', '.join(df.columns)}")
print("Gender:")
gender_counts = df['genderLabel'].value_counts()
for gender, count in gender_counts.items():
    print(f"{gender}: {count}", end=". ")

Total rows: 453
Total columns: 6
Name columns: person, genderLabel, personLabel, birthDate, placeBirthLabel, description
Gender:
masculino: 385. femenino: 63. mujer transgénero: 1. 

In [11]:
def get_stats(dataset: list[list[tuple]]):
    sentence_lengths = [len(sentence) for sentence in dataset]

    print(f"Total entries: {lamancha.shape}")
    print(f"Average sentence length: {round(np.mean(sentence_lengths))}")
    print(f"Minimum sentence length: {min(sentence_lengths)}")
    print(f"Maximum sentence length: {max(sentence_lengths)}")
    print(f"Percentile 25, length: {np.percentile(sentence_lengths, 25)}")
    print(f"Percentile 50, length: {np.percentile(sentence_lengths, 50)}")
    print(f"Percentile 75, length: {np.percentile(sentence_lengths, 75)}")

In [12]:
df["birthDate"]

0      1875-09-06T00:00:00Z
1      1991-06-22T00:00:00Z
2      1928-02-01T00:00:00Z
3                       NaN
4      1960-06-21T00:00:00Z
5      1959-01-01T00:00:00Z
6      1952-05-15T00:00:00Z
7      1981-08-14T00:00:00Z
8      1954-05-11T00:00:00Z
10     1877-01-01T00:00:00Z
11     1949-10-21T00:00:00Z
12     1470-01-01T00:00:00Z
14     1980-04-27T00:00:00Z
15     1965-01-01T00:00:00Z
16     1987-10-14T00:00:00Z
17     1958-04-24T00:00:00Z
18     1490-01-01T00:00:00Z
19     1980-07-07T00:00:00Z
20     1926-07-18T00:00:00Z
21     1979-10-25T00:00:00Z
22     1959-03-01T00:00:00Z
23     1970-02-14T00:00:00Z
24     1492-01-01T00:00:00Z
25     1986-01-14T00:00:00Z
26     1486-01-01T00:00:00Z
27     1485-01-01T00:00:00Z
28     1953-03-15T00:00:00Z
29     1977-07-30T00:00:00Z
30     1927-01-01T00:00:00Z
32     1968-07-19T00:00:00Z
33     1958-01-01T00:00:00Z
34     1951-01-01T00:00:00Z
35     1979-09-21T00:00:00Z
36                      NaN
37     1958-07-22T00:00:00Z
38     1551-01-01T00

In [13]:
df["signos"] = df["birthDate"].apply(lambda x: get_zodiac_sign(x))
sing_counts = df.value_counts("signos")
sing_counts

ValueError: too many values to unpack (expected 3)