Sqoop import (migración desde la BD hacia el HDFS)
===

* Última modificación: Mayo 19, 2022

Descarga de datos
---

In [1]:
file_name = "drivers.csv"
url = "https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/drivers/"
!wget --quiet {url + file_name} -P /tmp/

Creación y carga de datos de la base de datos
---

In [2]:
import sqlite3

conn = sqlite3.connect("sample.db")
cur = conn.cursor()

conn.executescript(
    """
    DROP TABLE IF EXISTS drivers;

    CREATE TABLE drivers (
        driverId       INT,
        name           STRING,
        ssn            STRING,
        location       STRING,
        certified      STRING,
        wage_plan      STRING
    );
    """
)
conn.commit()

with open("/tmp/drivers.csv", "rt") as f:
    data = f.readlines()

data = [line.replace('\n', '') for line in data]
data = [line.split(",") for line in data]
data = [tuple(line) for line in data]
data = data[1:]

cur.executemany("INSERT INTO drivers VALUES (?,?,?,?,?,?)", data)
conn.commit()
conn.close()

In [3]:
!sqlite3 sample.db "select * from drivers" | head

10|George Vetticaden|621011971|244-4532 Nulla Rd.|N|miles
11|Jamie Engesser|262112338|366-4125 Ac Street|N|miles
12|Paul Coddin|198041975|Ap #622-957 Risus. Street|Y|hours
13|Joe Niemiec|139907145|2071 Hendrerit. Ave|Y|hours
14|Adis Cesir|820812209|Ap #810-1228 In St.|Y|hours
15|Rohit Bakshi|239005227|648-5681 Dui- Rd.|Y|hours
16|Tom McCuch|363303105|P.O. Box 313- 962 Parturient Rd.|Y|hours
17|Eric Mizell|123808238|P.O. Box 579- 2191 Gravida. Street|Y|hours
18|Grant Liu|171010151|Ap #928-3159 Vestibulum Av.|Y|hours
19|Ajay Singh|160005158|592-9430 Nonummy Avenue|Y|hours


Importación de una tabla completa
--

In [4]:
!sqoop eval --connect jdbc:sqlite:sample.db --driver org.sqlite.JDBC --query "SELECT * FROM drivers LIMIT 3"

Please set $HBASE_HOME to the root of your HBase installation.
Please set $HCAT_HOME to the root of your HCatalog installation.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
-------------------------------------------------------------------------------------------------------------------------------------------
| driverId             | name                 | ssn                  | location             | certified            | wage_plan            | 
-------------------------------------------------------------------------------------------------------------------------------------------
| 10                   | George Vetticaden    | 621011971            | 244-4532 Nulla Rd.   | N                    | miles                | 
| 11                   | Jamie Engesser       | 262112338            | 366-4125 Ac Street   | N                    | miles                | 
| 12                   | Paul

In [5]:
%%writefile /tmp/full_import.sh

hdfs dfs -rm -r /tmp/drivers

sqoop-import \
    --connect jdbc:sqlite:sample.db \
    --driver org.sqlite.JDBC \
    --table drivers \
    --target-dir /tmp/drivers \
    --m 1

Overwriting /tmp/full_import.sh


In [6]:
!bash /tmp/full_import.sh

Deleted /tmp/drivers
Please set $HBASE_HOME to the root of your HBase installation.
Please set $HCAT_HOME to the root of your HCatalog installation.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
Note: /tmp/sqoop-root/compile/4dd202e347a783dab04b175af2a542fe/drivers.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Error: java.io.IOException: SQLException in nextKeyValue
	at org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:277)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:562)
	at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
	at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgr

In [7]:
!sqoop import --connect jdbc:sqlite:sample.db --driver org.sqlite.JDBC  --query "SELECT * FROM drivers WHERE driverId=10 AND \$CONDITIONS" --target-dir /tmp/drivers -m 1

Please set $HBASE_HOME to the root of your HBase installation.
Please set $HCAT_HOME to the root of your HCatalog installation.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
Note: /tmp/sqoop-root/compile/990a8af57056ffce0d7e285c173fb8f9/QueryResult.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
22/05/19 22:28:37 ERROR tool.ImportTool: Import failed: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://0.0.0.0:9000/tmp/drivers already exists
	at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
	at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:279)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:145)
	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)
	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)
	at j

In [8]:
!hdfs dfs -ls /user/root/drivers

ls: `/user/root/drivers': No such file or directory


In [9]:
!hdfs dfs -cat /user/root/truck_events/part-m-00000

cat: `/user/root/truck_events/part-m-00000': No such file or directory


Importación de un subconjunto de datos de una tabla
--

In [10]:
%%writefile partial-import.sh

hdfs dfs -rm -r /tmp/drivers

sqoop import \
    --connect jdbc:sqlite:sample.db \
    --table drivers \
    --driver org.sqlite.JDBC \
    -m 1 \
    --where "driverId=10" \
    --target-dir /tmp/drivers/

Overwriting partial-import.sh


In [11]:
!bash partial-import.sh

Deleted /tmp/drivers
Please set $HBASE_HOME to the root of your HBase installation.
Please set $HCAT_HOME to the root of your HCatalog installation.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
Note: /tmp/sqoop-root/compile/3db8756b79b6cf4b930c3a7a02c2c0a3/drivers.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Error: java.io.IOException: SQLException in nextKeyValue
	at org.apache.sqoop.mapreduce.db.DBRecordReader.nextKeyValue(DBRecordReader.java:277)
	at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:562)
	at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
	at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgr

In [12]:
!hdfs dfs -ls /tmp/drivers/

In [13]:
!hdfs dfs -cat /tmp/drivers/part-m-00000

cat: `/tmp/drivers/part-m-00000': No such file or directory


In [14]:
!rm sample.db *.java *.sh