# Base de datos. Tratamiento y exploración de los datos 🕵🏻

En este jupyter notebook se encuentra todo el proceso de limpieza y selección de los datos del dataset descargado de kaggle ["UN General Debates"](https://www.kaggle.com/unitednations/un-general-debates), para más tarde utilizarlo en la API.

## Índice 📎

- Importación de librerias y funciones
- Importación del dataset
- Exploración del dataset
- Importación del dataset definitivo

## 1. Importación de librerias y funciones 📚

In [1]:
import pandas as pd
import numpy as np
import re

## 2. Importación del dataset 📖

In [2]:
data = pd.read_csv("Data/un-general-debates.csv")
data.head()

Unnamed: 0,session,year,country,text
0,44,1989,MDV,﻿It is indeed a pleasure for me and the member...
1,44,1989,FIN,"﻿\nMay I begin by congratulating you. Sir, on ..."
2,44,1989,NER,"﻿\nMr. President, it is a particular pleasure ..."
3,44,1989,URY,﻿\nDuring the debate at the fortieth session o...
4,44,1989,ZWE,﻿I should like at the outset to express my del...


## 3. Exploración del dataset 🔎

Vamos a observar todas las columnas con el objetivo de identificar la información que nos proporciona la base de datos y seleccionar la información que queremos que contenga nuestra API.

In [3]:
data.columns

Index(['session', 'year', 'country', 'text'], dtype='object')

Comprobamos el número de NaNs:

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7507 entries, 0 to 7506
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   session  7507 non-null   int64 
 1   year     7507 non-null   int64 
 2   country  7507 non-null   object
 3   text     7507 non-null   object
dtypes: int64(2), object(2)
memory usage: 234.7+ KB


En este caso confirmamos que no existen NANs, por último eliminamos la primera columna "session" (se considera que no es relativa/importante para el contenido de nuestra API)

In [5]:
data2 = data.drop("session", axis=1)
data2

Unnamed: 0,year,country,text
0,1989,MDV,﻿It is indeed a pleasure for me and the member...
1,1989,FIN,"﻿\nMay I begin by congratulating you. Sir, on ..."
2,1989,NER,"﻿\nMr. President, it is a particular pleasure ..."
3,1989,URY,﻿\nDuring the debate at the fortieth session o...
4,1989,ZWE,﻿I should like at the outset to express my del...
...,...,...,...
7502,2001,KAZ,﻿This session\nthat is taking place under extr...
7503,2001,LBR,﻿I am honoured to\nparticipate in this histori...
7504,2001,BDI,﻿It\nis for me a signal honour to take the flo...
7505,2001,HUN,"﻿First, may I congratulate Mr. Han Seung-soo o..."


Renombramos las columnas para facilitar su lectura y posterior manipulación.

In [6]:
columnas = ["Year", "Country", "Text"]
data2.columns = columnas
data2.head()

Unnamed: 0,Year,Country,Text
0,1989,MDV,﻿It is indeed a pleasure for me and the member...
1,1989,FIN,"﻿\nMay I begin by congratulating you. Sir, on ..."
2,1989,NER,"﻿\nMr. President, it is a particular pleasure ..."
3,1989,URY,﻿\nDuring the debate at the fortieth session o...
4,1989,ZWE,﻿I should like at the outset to express my del...


Vamos a limpiar un poco la columna de Speech para eliminar espacios y símobolos innecesarios de cara a unificar el estilo y visualización del contenido.

In [7]:
#pasamos todo el texto a minúsculas
data2["Speech"] = data2["Text"].str.lower()

def clean(x):    
    cleaned = re.sub(r"(?s)<.?>", " ", x)
    cleaned = re.sub(r"[^A-Za-z0-9(),*!?\'\`]", " ", cleaned)
    cleaned = re.sub("\\\\u(.){4}", " ", cleaned)
    return cleaned.strip()

# aplicamos la función para limpiar el texto
data2["Speech"] = data2.Speech.apply(lambda x: clean(x))
#data2.head()

In [8]:
data3 = data2.drop("Text", axis=1)
data3.head()

Unnamed: 0,Year,Country,Speech
0,1989,MDV,it is indeed a pleasure for me and the members...
1,1989,FIN,"may i begin by congratulating you sir, on you..."
2,1989,NER,"mr president, it is a particular pleasure for..."
3,1989,URY,during the debate at the fortieth session of t...
4,1989,ZWE,i should like at the outset to express my dele...


## 4. Importación del dataset definitivo 🚀

In [9]:
data3.to_json("Data/un_to_json", orient = "records")

In [10]:
data3.to_csv("Data/un_to_csv")