Movimiento de datos entre MySQL y el HDFS
===

* Última modificación: Mayo 19, 2022

Descarga de datos
---

In [1]:
filenames = [
    "drivers.csv",
    "timesheet.csv",
    "truck_event_text_partition.csv",
]

url = "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/"

for filename in filenames:
    !wget --quiet {url + filename} -P /tmp/

Ejecución del contendor con MySQL
--

```
docker network create mysql-network
```

```
docker run --name mysql-instance \
    -e MYSQL_ROOT_PASSWORD=secret \
    --network mysql-network \
    -p 3306:3306 \
    -d mysql:5.7
```

Creación de la base de datos y permisos
---

```
docker exec mysql-instance mysql -uroot -psecret -e "CREATE DATABASE demo_db; "
docker exec mysql-instance mysql -uroot -psecret -e "CREATE USER 'sqoop'@'%' IDENTIFIED BY 'secret'; "
docker exec mysql-instance mysql -uroot -psecret -e "GRANT ALL ON demo_db.* TO 'sqoop'@'%';"
```

Ejeución del contendor con Sqoop
---

```
docker run --rm -it \
    --network mysql-network \
    -v "$PWD":/workspace \
    -p 50070:50070 \
    -p 8088:8088 \
    -p 8888:8888 \
    jdvelasq/sqoop:1.4.7
```

Creación y carga de datos en MySQL
---

In [2]:
import mysql.connector
import pandas as pd

conn = mysql.connector.connect(
    host="mysql-instance",
    user='sqoop',
    passwd='secret',
    db='demo_db',
)
cur = conn.cursor()

cur.execute(
    """
    DROP TABLE IF EXISTS drivers;
    """
)

cur.execute(
    """
    CREATE TABLE drivers (
        driverId       INT,
        name           VARCHAR(20),
        ssn            VARCHAR(20),
        location       VARCHAR(40),
        certified      VARCHAR(20),
        wage_plan      VARCHAR(20)
    );
    """
)

cur.execute(
    """
    DROP TABLE IF EXISTS timesheet;
    """
)

cur.execute(
    """
    CREATE TABLE timesheet (
        driverId       INT,
        week           INT,
        hours_logged   INT,
        miles_logged   INT
    );
    """
)

conn.commit()


drivers = pd.read_csv('/tmp/drivers.csv')

for i, row in drivers.iterrows():
    sql = "INSERT INTO drivers VALUES (%s,%s,%s,%s,%s,%s)"
    cur.execute(sql, tuple(row))
    conn.commit()

cur.execute("SELECT * FROM drivers LIMIT 5;")
result = cur.fetchall()
conn.close()
result

[(10, 'George Vetticaden', '621011971', '244-4532 Nulla Rd.', 'N', 'miles'),
 (11, 'Jamie Engesser', '262112338', '366-4125 Ac Street', 'N', 'miles'),
 (12, 'Paul Coddin', '198041975', 'Ap #622-957 Risus. Street', 'Y', 'hours'),
 (13, 'Joe Niemiec', '139907145', '2071 Hendrerit. Ave', 'Y', 'hours'),
 (14, 'Adis Cesir', '820812209', 'Ap #810-1228 In St.', 'Y', 'hours')]

Listado de las bases de datos em MySQL
---

In [3]:
%%writefile list-databases.sh
sqoop list-databases \
    --connect jdbc:mysql://mysql-instance:3306/demo_db \
    --username sqoop \
    --password secret

Writing list-databases.sh


In [4]:
#
# El error es debido a que SQLite3 no tiene databases internamente.
#
!bash list-databases.sh

Please set $HBASE_HOME to the root of your HBase installation.
Please set $HCAT_HOME to the root of your HCatalog installation.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
Fri May 20 15:49:52 UTC 2022 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
information_schema
demo_db


Listado de las tablas existentes en la base de datos de MySQL
--

In [5]:
%%writefile list-tables.sh
sqoop list-tables \
    --connect jdbc:mysql://mysql-instance:3306/demo_db \
    --username sqoop \
    --password secret

Writing list-tables.sh


In [6]:
!bash list-tables.sh

Please set $HBASE_HOME to the root of your HBase installation.
Please set $HCAT_HOME to the root of your HCatalog installation.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
Fri May 20 15:49:53 UTC 2022 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
drivers
timesheet


Verificación de los registros en MySQL con query
--

In [7]:
%%writefile query.sh
sqoop eval \
    --connect jdbc:mysql://mysql-instance:3306/demo_db \
    --username sqoop \
    --password secret \
    --query "SELECT * FROM drivers LIMIT 3"

Writing query.sh


In [8]:
!bash query.sh

Please set $HBASE_HOME to the root of your HBase installation.
Please set $HCAT_HOME to the root of your HCatalog installation.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
Fri May 20 15:49:54 UTC 2022 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
----------------------------------------------------------------------------------------------------------------------------------
| driverId    | name                 | ssn                  | location             | 

Importación de una tabla completa al HDFS
--

In [9]:
%%writefile /tmp/full_import.sh

hdfs dfs -rm -r /tmp/drivers

sqoop import \
    --connect jdbc:mysql://mysql-instance:3306/demo_db \
    --username sqoop \
    --password secret \
    --table drivers \
    --target-dir /tmp/drivers \
    --m 1

Overwriting /tmp/full_import.sh


In [10]:
!bash /tmp/full_import.sh

Deleted /tmp/drivers
Please set $HBASE_HOME to the root of your HBase installation.
Please set $HCAT_HOME to the root of your HCatalog installation.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
Fri May 20 15:49:57 UTC 2022 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Note: /tmp/sqoop-root/compile/7c6ccfc14343ffffba7485cfd3db64d8/drivers.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Fri May 20 15:50:03 UTC 2022 W

In [11]:
!hdfs dfs -ls /tmp/drivers

Found 2 items
-rw-r--r--   1 root supergroup          0 2022-05-20 15:50 /tmp/drivers/_SUCCESS
-rw-r--r--   1 root supergroup       1963 2022-05-20 15:50 /tmp/drivers/part-m-00000


In [12]:
!hdfs dfs -cat /tmp/drivers/part-m-00000

10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
11,Jamie Engesser,262112338,366-4125 Ac Street,N,miles
12,Paul Coddin,198041975,Ap #622-957 Risus. Street,Y,hours
13,Joe Niemiec,139907145,2071 Hendrerit. Ave,Y,hours
14,Adis Cesir,820812209,Ap #810-1228 In St.,Y,hours
15,Rohit Bakshi,239005227,648-5681 Dui- Rd.,Y,hours
16,Tom McCuch,363303105,P.O. Box 313- 962 Parturient Rd.,Y,hours
17,Eric Mizell,123808238,P.O. Box 579- 2191 Gravida. Street,Y,hours
18,Grant Liu,171010151,Ap #928-3159 Vestibulum Av.,Y,hours
19,Ajay Singh,160005158,592-9430 Nonummy Avenue,Y,hours
20,Chris Harris,921812303,883-2691 Proin Avenue,Y,hours
21,Jeff Markham,209408086,Ap #852-7966 Facilisis St.,Y,hours
22,Nadeem Asghar,783204269,154-9147 Aliquam Ave,Y,hours
23,Adam Diaz,928312208,P.O. Box 260- 6127 Vitae Road,Y,hours
24,Don Hilborn,254412152,4361 Ac Road,Y,hours
25,Jean-Philippe Playe,913310051,P.O. Box 812- 6238 Ac Rd.,Y,hours
26,Michael Aube,124705141,P.O. Box 213- 8948 Nec Ave,Y,hours
27,Mark Lochbih

Importación de un subconjunto de datos de una tabla al HDFS
--

In [13]:
%%writefile partial-import.sh

hdfs dfs -rm -r /tmp/drivers

sqoop import \
    --connect jdbc:mysql://mysql-instance:3306/demo_db \
    --username sqoop \
    --password secret \
    --table drivers \
    --target-dir /tmp/drivers/ \
    -m 1 \
    --where "driverId=10"

Writing partial-import.sh


In [14]:
!bash partial-import.sh

Deleted /tmp/drivers
Please set $HBASE_HOME to the root of your HBase installation.
Please set $HCAT_HOME to the root of your HCatalog installation.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
Fri May 20 15:50:20 UTC 2022 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Note: /tmp/sqoop-root/compile/db3e5e0d16414871f43b3ea4a8726bbb/drivers.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Fri May 20 15:50:24 UTC 2022 W

In [15]:
!hdfs dfs -ls /tmp/drivers/

Found 2 items
-rw-r--r--   1 root supergroup          0 2022-05-20 15:50 /tmp/drivers/_SUCCESS
-rw-r--r--   1 root supergroup         58 2022-05-20 15:50 /tmp/drivers/part-m-00000


In [16]:
!hdfs dfs -cat /tmp/drivers/part-m-00000

10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles


Movimiento de `timesheet.csv` al HDFS
--

In [17]:
!tail +2 /tmp/timesheet.csv > /tmp/timesheet1.csv
!head /tmp/timesheet1.csv

10,1,70,3300
10,2,70,3300
10,3,60,2800
10,4,70,3100
10,5,70,3200
10,6,70,3300
10,7,70,3000
10,8,70,3300
10,9,70,3200
10,10,50,2500


In [18]:
!hdfs dfs -rm /tmp/timesheet.csv
!hdfs dfs -copyFromLocal /tmp/timesheet1.csv /tmp/timesheet.csv
!hdfs dfs -ls /tmp/

Deleted /tmp/timesheet.csv
Found 3 items
drwxr-xr-x   - root supergroup          0 2022-05-20 15:50 /tmp/drivers
drwx------   - root supergroup          0 2022-05-20 15:05 /tmp/hadoop-yarn
-rw-r--r--   1 root supergroup      26164 2022-05-20 15:50 /tmp/timesheet.csv


Exportación de datos del HDFS a MySQL
--

In [19]:
%%writefile /tmp/export.sh

sqoop export \
    --connect jdbc:mysql://mysql-instance:3306/demo_db \
    --username sqoop \
    --password secret \
    --table timesheet \
    --export-dir /tmp/timesheet.csv

Overwriting /tmp/export.sh


In [20]:
!bash /tmp/export.sh

Please set $HBASE_HOME to the root of your HBase installation.
Please set $HCAT_HOME to the root of your HCatalog installation.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
Fri May 20 15:50:45 UTC 2022 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Note: /tmp/sqoop-root/compile/c4bbeeef55f6c8e56619ef8950c5b89c/timesheet.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.


Verificación
---

In [21]:
conn = mysql.connector.connect(
    host="mysql-instance",
    user='sqoop',
    passwd='secret',
    db='demo_db',
)
cur = conn.cursor()

cur.execute(
    """
    SELECT * FROM timesheet LIMIT 5;
    """
)
result = cur.fetchall()
conn.close()
result

[(18, 30, 54, 2738),
 (18, 31, 55, 2510),
 (18, 32, 60, 2695),
 (18, 33, 54, 2564),
 (18, 34, 53, 2778)]

In [22]:
!rm *.java *.sh