# AUTOMATIZACIÓN DE CREACIÓN DE DATOS FICTICIOS

Faker es una librería de Python diseñada para generar datos falsos que imitan datos del mundo real. Puede generar nombres, direcciones, correos electrónicos, datos financieros, textos, y otros tipos de información de manera rápida y sencilla. Esto es especialmente útil en el desarrollo de aplicaciones, pruebas de software, o análisis de datos, donde necesitas grandes volúmenes de datos ficticios para realizar tests sin utilizar datos reales.

### Instalación

In [1]:
%pip install faker






Luego, en tu código, simplemente creas una instancia de Faker. Si no especificas el idioma, Faker usará el inglés por defecto, pero puedes cambiarlo según la región que prefieras (por ejemplo, es_ES para español de España).

In [2]:
from faker import Faker
fake = Faker('es_ES')  # Datos en español

### Principales Proveedores de Datos

Faker utiliza "proveedores" para generar datos específicos. Algunos de los más importantes son:

* Personales: name(), address(), email(), phone_number().
* Temporales: date(), time(), date_of_birth().
* Geográficos: country(), city(), address(), zipcode().
* Textuales: sentence(), paragraph(), text().

In [3]:
fake = Faker('es_ES')

print(fake.name())  # Nombre aleatorio
print(fake.address())  # Dirección aleatoria
print(fake.email())  # Correo electrónico aleatorio
print(fake.phone_number())  # Número de teléfono aleatorio

Alba Franco
Camino Celestina Tovar 5 Apt. 31 
Granada, 09691
ngodoy@example.org
+34 981 416 581


Faker usa una gran cantidad de listas y reglas internas para generar datos que parezcan realistas. Por ejemplo, cuando llamas a fake.name(), Faker no solo selecciona un nombre de una lista, sino que utiliza formatos predefinidos como <Nombre> <Apellido>, asegurando variabilidad.

In [4]:
fake = Faker()

# Genera un nombre completo que sigue el formato adecuado
nombre = fake.name()
print(nombre)  # Ejemplo: "Pedro García"

# Genera una dirección completa
direccion = fake.address()
print(direccion)  # Ejemplo: "Calle Falsa 123, 28080 Madrid"

Kelli Forbes
6290 Riley Port
Cherylville, LA 73357


### Métodos Avanzados y Personalización

1. Proveedores personalizados: Puedes agregar tu propio proveedor de datos. Esto es útil cuando necesitas datos que Faker no proporciona de forma predeterminada.

In [5]:
from faker.providers import BaseProvider

class MyProvider(BaseProvider):
    def nombre_customizado(self):
        nombres = ['Nombre1', 'Nombre2', 'Nombre3']
        return self.random_element(nombres)

fake.add_provider(MyProvider)
print(fake.nombre_customizado())  # Resultado: Nombre1, Nombre2 o Nombre3

Nombre2


2. Datos únicos: Si no quieres que Faker repita los datos, puedes usar unique. Esto asegura que no haya duplicados en el conjunto de datos generado.

In [6]:
print(fake.unique.name())  # Genera un nombre único cada vez

Lisa Brown


### Datos Localizados

Puedes generar conjuntos de datos falsos para pruebas o demostraciones, y luego exportarlos a formatos como CSV o JSON. Esto es útil para generar grandes volúmenes de datos que puedes cargar en bases de datos o utilizar en simulaciones.

In [7]:
import pandas as pd

fake = Faker()

# Generar 100 registros falsos
data = {
    'Nombre': [fake.name() for _ in range(100)],
    'Correo': [fake.email() for _ in range(100)],
    'Fecha de Nacimiento': [fake.date_of_birth() for _ in range(100)],
    'Teléfono': [fake.phone_number() for _ in range(100)]
}

# Crear un DataFrame de pandas
df = pd.DataFrame(data)

# Exportar a CSV
df.to_csv('datos_falsos.csv', index=False)

# EJEMPLO FAKER DATASET

In [10]:
#es una herramienta para generar datos falsos de forma automática dentro de una aplicación
%pip install Faker 
#permite la interacción con bases de datos MySQL
%pip install PyMySQL 

Note: you may need to restart the kernel to use updated packages.









In [11]:
import codecs
from datetime import date
from datetime import datetime
from faker import Faker
import pymysql
import random
import requests
import csv
import pandas as pd
from sqlalchemy import create_engine

## Inicialización de MySQL con datos de prueba

Constantes utilizadas durante la construcción del dataset:

In [12]:
NUMERO_CLIENTES = 500
NUMERO_PROVEEDORES = 10
SEMILLA_ALEATORIA_GENERADOR = 10
SEMILLA_ALEATORIA_RANDOM = 1

Inicializamos generador de contenido ficticio en español y semillas aleatorias para que el dataset generado sea siempre el mismo:

In [13]:
Faker.seed(SEMILLA_ALEATORIA_GENERADOR)
random.seed(SEMILLA_ALEATORIA_RANDOM)
fake = Faker(['es_ES'])

Funciones para la generación del conjunto de datos:

In [14]:
def build_providers_dataset(number):
  providers = []
  for i in range(1, number+1):
    providers.append({
      "provider_id": i,
      "name": fake.company(),
      "email": fake.company_email(),
      "webpage": fake.domain_name()
    })

  return {
      "providers": providers
  }


In [15]:
def build_products_dataset(providers_info):
  products = []
  url = 'https://drive.google.com/uc?export=view&id=1D9MY0au4b7SXwhUdm6TNfsKfYzdkbAh_'
  content = requests.get(url)
  text = codecs.iterdecode(content.iter_lines(), 'utf-8')
  reader = csv.DictReader(text, delimiter=',', quotechar='"')
  for row in reader:
    products.append(row)

  categories = sorted(set([product['category'] for product in products]))
  categories = [{"category_id": i+1, "name": category} for (i, category) in enumerate(categories)]
  categories_by_name = {category["name"]: category["category_id"] for category in categories}
  products = [{"product_id": i+1, 
              "name": product["name"], 
              "price": float(product["price"]), 
              "category_id": categories_by_name[product["category"]],
              "provider_id": random.choice(providers_info)["provider_id"]} 
              for (i, product) in enumerate(products)]
  return {
      'products': products,
      'categories': categories
  }

In [16]:
def build_people_dataset(number):

  people = []
  addresses = []
  payment_info = []
  address_id = 0
  payment_id = 0

  for i in range(1, number+1):
    # Person data
    people.append({
      "person_id": i,
      "first_name": fake.first_name(),
      "last_name": fake.last_name(),
      "birth_date": fake.date_between_dates(datetime(1960, 1, 1), datetime(2002, 6, 1)),
      "email": fake.email(),
      "phone": fake.phone_number(),
      "username": fake.user_name(),
      "password": fake.sha256(),
      "job": fake.job()
    })

    # Payment information
    if random.choice([False]*1 + [True]*2):
      payment_id += 1
      payment_info.append({
          "payment_id": payment_id,
          "person_id": i,
          "expiration": fake.credit_card_expire(),
          "number": fake.credit_card_number(),
          "provider": fake.credit_card_provider(),
          "security_code": fake.credit_card_security_code()
      })

    # Registered addresses
    for j in range(random.choice([1]*43 + [2]*6 + [3])):
      address_id+=1
      addresses.append(
      {
        "address_id": address_id,
        "person_id": i,
        "city": fake.city(),
        "number": fake.building_number(),
        "country": "España",
        "zipcode": fake.postcode(),
        "street": fake.street_name()
      })

  return {
      "people": people,
      "addresses": addresses,
      "payment_information": payment_info,
  }


In [17]:
def build_network_dataset(people_info):

  WEB_PAGES = [fake.uri_path() for i in range(0,100)]
  ACCESS_METHOD_PROPORTION = ['GET'] * 10 + ['POST'] 
  pages = []
  accesses = []
  access_id = 0

  for i in range(0, len(WEB_PAGES)):
    pages.append({
        "page_id": i+1,
        "path": WEB_PAGES[i]
    })

  for person in people_info:
    # Access to webpages
    for j in range(int(random.gauss(60, 40))):
      access_id += 1
      accesses.append({
          "access_id": access_id,
          "person_id": person["person_id"],
          "method": random.choice(ACCESS_METHOD_PROPORTION),
          "ip": fake.ipv4_public(),
          "date": fake.date_time_between(datetime(2020,1,1,0,0,0), datetime(2020,9,1,23,59,59)),
          "page_id": random.randint(1, len(WEB_PAGES)-1)
      })

  # Anonymous access
  for i in range(int(random.gauss(1000, 100))):
    access_id += 1
    accesses.append({
        "access_id": access_id,
        "person_id": None,
        "method": random.choice(ACCESS_METHOD_PROPORTION),
        "ip": fake.ipv4_public(),
        "date": fake.date_time_between(datetime(2020,1,1,0,0,0), datetime(2020,9,1,23,59,59)),
        "page_id": random.randint(1, len(WEB_PAGES)-1)
    })

  return {
    "web_pages":  pages,
    "accesses": accesses
  }


In [18]:
def build_shopping_dataset(people, products, people_addresses):

  shopping_carts = []
  shopping_cart_products = []
  orders = []
  order_products = []
  invoices = []
  cart_id = 0
  shopping_cart_id = 0
  order_id = 0
  order_product_id = 0
  invoice_id = 0

  PRODUCTS_PROBABILITY = [1]*2 + [2] * 3 + [3] * 3 + [4]*2 + [5]
  ORDER_PROBABILITY = [0]+[1]*7+[2]*3+[3]*3+[4]*2+[5]
  QUANTITY_PROBABILITY = [1]*5 +[2]*2 +[3]
  RATING_PROBABILITY = [1]+[2]+[3]*2+[4]*4+[5]*3

  for person in people:
    # Build shopping cart
    if random.choice([False * 9] + [True]):
      cart_id += 1
      shopping_carts.append({
          "cart_id": cart_id,
          "person_id": person["person_id"],
          "date": fake.date_time_between(datetime(2020,1,1,0,0,0), datetime(2020,9,1,23,59,59)),
      })

      chosen = random.sample(products, k = random.choice(PRODUCTS_PROBABILITY))
      for product in chosen:
        shopping_cart_id += 1
        shopping_cart_products.append({
            "cart_id": cart_id,
            "product_id": product["product_id"],
            "quantity": random.choice(QUANTITY_PROBABILITY)
        })
    
    # Build orders
    for i in range(0, random.choice(ORDER_PROBABILITY)):
      order_id += 1
      order_price = 0
      chosen = random.sample(products, k = random.choice(PRODUCTS_PROBABILITY))
      for product in chosen:
        order_product_id += 1
        quantity = random.choice(QUANTITY_PROBABILITY)
        order_products.append({
            "order_id": order_id,
            "product_id": product["product_id"],
            "quantity": quantity
        })
        order_price += quantity * product['price']

      person_addresses = [address for address in people_addresses if address["person_id"] == person["person_id"]]
      delivery_address = random.choice(person_addresses)
      billing_address = random.choice(person_addresses)
      orders.append({
          "order_id": order_id,
          "person_id": person["person_id"],
          "date": fake.date_time_between(datetime(2020,1,1,0,0,0), datetime(2020,9,1,23,59,59)),
          # Purposely left wrong
          "delivery_address": delivery_address['address_id'],
          "billing_address": billing_address['address_id'],
          "price": order_price
      })

  # Build invoices
  for order in random.choices(orders, k = int(len(orders) * 0.8)):
    invoice_id += 1
    invoices.append({
      "invoice_id": invoice_id,
      "order_id": order["order_id"],
      "date": fake.date_time_between(order["date"], datetime(2020,9,1,23,59,59)),
      "rating": random.choice(RATING_PROBABILITY)
    })

  return {
      'carts': shopping_carts,
      'cart_product': shopping_cart_products,
      'orders': orders,
      'order_product': order_products,
      'invoices': invoices    
  }

Construcción incremental del dateset:

In [19]:
dataset = {}
dataset.update(build_providers_dataset(NUMERO_PROVEEDORES))
dataset.update(build_products_dataset(dataset['providers']))
dataset.update(build_people_dataset(NUMERO_CLIENTES))
dataset.update(build_network_dataset(dataset['people']))
dataset.update(build_shopping_dataset(dataset['people'], dataset['products'], dataset['addresses']))

### Carga de dataset en MySQL

#### Creación de la base de datos

Script para la creación de la base de datos en mysql

```
DROP SCHEMA IF EXISTS shop;
CREATE SCHEMA shop CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
USE shop;

DROP TABLE IF EXISTS accesses;
CREATE TABLE accesses (
    access_id INT,
    person_id INT NULL DEFAULT NULL,
    date DATETIME,
    ip VARCHAR(20),
    method VARCHAR(10),
    page_id INT,
    PRIMARY KEY(access_id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

DROP TABLE IF EXISTS web_pages;
CREATE TABLE web_pages (
    page_id INT,
    path VARCHAR(250),
    PRIMARY KEY(page_id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

DROP TABLE IF EXISTS carts;
CREATE TABLE carts (
    cart_id INT,
    person_id INT,
    date DATETIME,
    PRIMARY KEY(cart_id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

DROP TABLE IF EXISTS cart_product;
CREATE TABLE cart_product (
    cart_id INT,
    product_id INT,
    quantity INT,
    PRIMARY KEY(cart_id, product_id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

DROP TABLE IF EXISTS categories;
CREATE TABLE categories (
    category_id INT,
    name VARCHAR(100),
    PRIMARY KEY(category_id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

DROP TABLE IF EXISTS invoices;
CREATE TABLE invoices (
    invoice_id INT,
    order_id INT,
    date DATETIME,
    rating INT,
    PRIMARY KEY(invoice_id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

DROP TABLE IF EXISTS orders;
CREATE TABLE orders (
    order_id INT,
    person_id INT,
    date DATETIME,
    billing_address INT,
    delivery_address INT,
    price DECIMAL(18,6),
    PRIMARY KEY(order_id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

DROP TABLE IF EXISTS order_product;
CREATE TABLE order_product (
    order_id INT,
    product_id INT,
    quantity INT,
    PRIMARY KEY(order_id, product_id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

DROP TABLE IF EXISTS payment_information;
CREATE TABLE payment_information (
    payment_id INT,
    person_id INT,
    number VARCHAR(30),
    provider VARCHAR(200),
    security_code VARCHAR(10),
    expiration VARCHAR(5),
    PRIMARY KEY(payment_id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

DROP TABLE IF EXISTS people;
CREATE TABLE people (
    person_id INT,
    birth_date DATETIME,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    email VARCHAR(150),
    job VARCHAR(100),
    phone VARCHAR(20),
    username VARCHAR(50),
    password VARCHAR(100),
    PRIMARY KEY(person_id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

DROP TABLE IF EXISTS addresses;
CREATE TABLE addresses (
    address_id INT,
    person_id INT,
    city VARCHAR(30),
    country VARCHAR(20),
    number INT,
    street VARCHAR(100),
    zipcode INT,
    PRIMARY KEY(address_id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

DROP TABLE IF EXISTS products;
CREATE TABLE products (
    product_id INT,
    category_id INT NULL DEFAULT NULL,
    provider_id INT NULL DEFAULT NULL,
    name VARCHAR(200),
    price DECIMAL(10,4),
    PRIMARY KEY(product_id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

DROP TABLE IF EXISTS providers;
CREATE TABLE providers (
    provider_id INT,
    name VARCHAR(50),
    email VARCHAR(100),
    webpage VARCHAR(100),
    PRIMARY KEY(provider_id)
)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

COMMIT;
```

Cargamos los datos del dataset en la base de datos

In [20]:
con = pymysql.connect(host='localhost', user='admin',password='Admin_2024', database='shop')
try:
    for table in dataset:
      first_time = True
      sql = ""
      with con.cursor() as cur:        
        for entity in dataset[table]:
          if (first_time):
            first_time = False
            str_columns = ",".join(entity.keys())
            str_values = ",".join(["%s"] * len(entity.keys()))
            sql = f"INSERT INTO {table} ({str_columns}) VALUES ({str_values})"
          cur.execute(sql, tuple(entity.values()))
        con.commit()
finally:
    con.close()

Script para las restriciones de foreign key de la base de datos

```
ALTER TABLE products ADD CONSTRAINT FK_product_category_id
FOREIGN KEY(category_id) REFERENCES categories(category_id)
ON DELETE SET NULL
ON UPDATE CASCADE;

ALTER TABLE products ADD CONSTRAINT FK_product_provider_id
FOREIGN KEY(provider_id) REFERENCES providers(provider_id)
ON DELETE SET NULL
ON UPDATE CASCADE;

ALTER TABLE addresses ADD CONSTRAINT FK_addresses_person_id
FOREIGN KEY(person_id) REFERENCES people(person_id)
ON DELETE CASCADE
ON UPDATE CASCADE;

ALTER TABLE payment_information ADD CONSTRAINT FK_payment_information_person_id
FOREIGN KEY(person_id) REFERENCES people(person_id)
ON DELETE CASCADE
ON UPDATE CASCADE;

ALTER TABLE order_product ADD CONSTRAINT FK_order_product_order_id
FOREIGN KEY(order_id) REFERENCES orders(order_id)
ON DELETE CASCADE
ON UPDATE CASCADE;

ALTER TABLE order_product ADD CONSTRAINT FK_order_product_product_id
FOREIGN KEY(product_id) REFERENCES products(product_id)
ON DELETE RESTRICT
ON UPDATE CASCADE;

ALTER TABLE orders ADD CONSTRAINT FK_orders_person_id
FOREIGN KEY(person_id) REFERENCES people(person_id)
ON DELETE RESTRICT
ON UPDATE CASCADE;

ALTER TABLE orders ADD CONSTRAINT FK_orders_billing_address_id
FOREIGN KEY(billing_address) REFERENCES addresses(address_id)
ON DELETE RESTRICT
ON UPDATE CASCADE;

ALTER TABLE orders ADD CONSTRAINT FK_orders_delivery_address_id
FOREIGN KEY(delivery_address) REFERENCES addresses(address_id)
ON DELETE RESTRICT
ON UPDATE CASCADE;

ALTER TABLE accesses ADD CONSTRAINT FK_accesses_person_id 
FOREIGN KEY(person_id) REFERENCES people(person_id)
ON DELETE SET NULL
ON UPDATE CASCADE;

ALTER TABLE accesses ADD CONSTRAINT FK_accesses_page_id
FOREIGN KEY(page_id) REFERENCES web_pages(page_id)
ON DELETE CASCADE
ON UPDATE CASCADE;

ALTER TABLE carts ADD CONSTRAINT FK_carts_person_id 
FOREIGN KEY(person_id) REFERENCES people(person_id)
ON DELETE CASCADE
ON UPDATE CASCADE; 

ALTER TABLE cart_product ADD CONSTRAINT FK_cart_product_cart_id
FOREIGN KEY(cart_id) REFERENCES carts(cart_id)
ON DELETE CASCADE
ON UPDATE CASCADE;

ALTER TABLE cart_product ADD CONSTRAINT FK_cart_product_product_id
FOREIGN KEY(product_id) REFERENCES products(product_id)
ON DELETE RESTRICT
ON UPDATE CASCADE;

ALTER TABLE invoices ADD CONSTRAINT FK_invoices_order_id
FOREIGN KEY(order_id) REFERENCES orders(order_id)
ON DELETE RESTRICT
ON UPDATE RESTRICT;
```