# Data Loading

This notebook will document the functions and other code related with the loading of the datasets into Jupyter or Colab, from different sources like AWS-S3, google Drive  or a local repository, And in different formats (CSV, HDF5, feather, pickles) or with different techniques (batches, Dask, etc.)

This is the first stage for all the subsequent stages of working with data.


In [1]:
# First import the required libraries.
import pandas as pd
import numpy as np

In [2]:
# To load a file into colab:
import io
import os

In [3]:
# To load a file from AWS-S3
import boto3
import logging
from botocore.exceptions import ClientError

# Step 1: Find the file

In [None]:
# Get the current directory
print(os.getcwd())
# Check if the directory exists
print(os.path.exists('../Inferencia y recomendacion - EGM'))

/home/jovyan/work/PROJECT/Data
True


In [138]:
cd ../Inferencia y recomendacion - EGM

/home/jovyan/work/PROJECT/Inferencia y recomendacion - EGM


In [139]:
ls

 [0m[01;32m8aca180564c2-Anexo_1___Segmentación_de_clientes_e_Inferencias_de_información__1_.pdf[0m[K*
[01;32m'Acuerdo - Correlation One y EGM.pdf'[0m*
[01;32m'Documento t'$'\302\202''cnico - MinTIC - RetoInferenciaRecomendacion.docx'[0m*
 [01;32mplacetopayDB1.csv[0m*
 [01;32mplacetopayDB2.csv[0m*
[01;32m'Sebastian Londono - Inferencia y recomendacion - EGM.zip'[0m*


In [140]:
# This are the adresses where to find the file. You have to update this info:
FileName = 'placetopayDB2.csv'

## Upload Files to Google Drive

In [None]:
# To Upload files to Google Colab
from google.colab import files
uploaded = files.upload()

In [None]:
# Address of the file in Google Drive (This is only to show the file in drive, for reference):
# DB1
#https://drive.google.com/file/d/1IJWjFNE2Fz4o0TCdRmB-UijFtMounted at /content/gdrivei7m0MLH/view?usp=sharing

# DB2
URL = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

# Actual address to use in order to load the file into Colab:
#FilePath = '/content/drive/My Drive/DS4A-3/Place to pay - DS4A - Databases and Notebooks/placetopayDB2.csv'

#Merchants:
FilePath = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

# Step 2: Loading
## Load CSV into colab from Google Drive:

In [None]:
# For Using Google Drive, (Only if executing notebook from Google colab):
from google.colab import drive
drive.mount('/content/drive')

# after that:
# <--- Refresh mounted Drive
# <--- Look for file and get the path link

Mounted at /content/gdrive


In [None]:
# Load full database into Colab:
bd = pd.read_csv( FilePath, encoding = 'utf-8')
# Dataset is now stored in a Pandas Dataframe

## Load CSV into jupyter from local folder:

In [141]:
# To load full file. 
from csv import reader

# if it fails it might be necessary to add encoding = "utf-8"
opened_file = open('placetopayDB2.csv')   # File path 
read_file = reader(opened_file,delimiter=',')
read_file

<_csv.reader at 0x7f0ce8dc4cf8>

In [142]:
# just to check schema is as expected: 
Sample = pd.read_csv(FileName, nrows = 10) 
# Dataset is now stored in a Pandas Dataframe, but only the first 10 rows
# maybe you need to use sep=',', error_bad_lines=False, encoding='utf-8' / 'latin1'
print('Dataset is now stored in a Pandas Dataframe')
Sample.head()

Dataset is now stored in a Pandas Dataframe


Unnamed: 0,transaction_user_agent,transaction_id,transaction_description,transaction_processing_date_,transaction_processing_hour,transaction_request_language,transaction_payer_id,transaction_payer_document_type,transaction_payer_email,IP,...,reason_code_iso,reason_description,reason_clasiffication,paymentmethod_name,paymentmethod_franchise,paymentmethod_type,isic_division_id,isic_division_name,isic_section_id,isic_section_name
0,Mozilla/5.0 (Linux; Android 8.0.0; SM-G930F) A...,COA1494204936,Pago con QR,16/09/2020,9,ES,"D.B>""GH7@2U';:]7L?Y""KAX2KQ=3>H$E&%W<=:[L+%4",CRCPF,"O`>CT'APF_+(_E&_`=&!#68:+%^=6R;L<Z4L!=>""_LO","L?B]A+]``$6);JIP""7<P]$.HD[*2>X^N#6UQ#J\E0GX]$....",...,00,Aprobada,Red/Banco,Transerver Mastercard,MASTERCARD,CREDITCARD,,,,
1,Mozilla/5.0 (iPhone; CPU iPhone OS 13_6_1 like...,COA1493496261,Pago por QR,02/09/2020,18,ES,"ET]'&;[G9N0U55`QCCX>""'-!7>,U#JUJ(IS/G6F92""5",CRCPF,"KS8!F9H!""(<P\(N6Q%$K%9HD=5^SGUSK)EQ453U3\C/","D\A""`H@+T#6AS%G6HT(M]$.HYGA::C6;#6EBZD(Q+`Q]$....",...,?2,Transaccion Declinada Por Politicas De Control...,Políticas de control de riesgos,Transerver Visa,VISA,CREDITCARD,,,,
2,Mozilla/5.0 (Linux; Android 8.0.0; SM-G930F) A...,COA1493494554,Pago por QR,02/09/2020,18,ES,"D.B>""GH7@2U';:]7L?Y""KAX2KQ=3>H$E&%W<=:[L+%4",CRCPF,"O`>CT'APF_+(_E&_`=&!#68:+%^=6R;L<Z4L!=>""_LO","L?B]A+]``$6);JIP""7<P]$.HD[*2>X^N#6UQ#J\E0GX]$....",...,00,Aprobada,Red/Banco,Transerver Mastercard,MASTERCARD,CREDITCARD,,,,
3,Mozilla/5.0 (Linux; Android 10; SAMSUNG SM-G97...,COA1492916120,4395,25/08/2020,13,ES,"FZ0#UZ`R')9'I8A^4$D$7_/\*U=@G(CGY[#55S`[,!@",CRCPF,"F_NL4'KE7W!AS+M[,M2FNKTSW$5DT)<$8X-X\RP^'W=",D3M%3I)LU#6Q.HC[C:6`]$.L!HM$@:CD#6I-:!AOV/L]$....,...,00,Aprobada,Red/Banco,Transerver Mastercard,MASTERCARD,CREDITCARD,,,,
4,Mozilla/5.0 (iPhone; CPU iPhone OS 13_6_1 like...,COA1493496172,Pago por QR,02/09/2020,18,ES,"IUAE(,>""%&@IPA->+UM58^?!+IM?E^6.T6Q,:OE&PR9",CRCPF,OB:6$X[5R&1Q24:K2;$<Z[*-'J#4D>8WW4I<](_1P%#,"L?B]A+]``$6);JIP""7<P]$.HD[*2>X^N#6UQ#J\E0GX]$....",...,?-,Transaccion Pendiente. Por Favor Consulte Con ...,Red/Banco,Transerver Visa,VISA,CREDITCARD,,,,


In [None]:
# Load the database to start exploratory analysis:
bd = pd.read_csv( FileName, encoding = 'utf-8')
# Dataset is now stored in a Pandas Dataframe
bd.head()

## Sampling

In [None]:
# Load the first rows of the database to start exploratory analysis:
bd = pd.read_csv( FileName, encoding = 'utf-8', nrows = 100000) # Load a Sample of the first 100 000 rows
# Dataset is now stored in a Pandas Dataframe
bd.head()

In [None]:
import random

# Count the lines
num_lines = sum(1 for l in open(FileName))

# Sample size - in this case: 1000 000 rows
Rows = 1000000

# The row indices to skip - make sure 0 is not included to keep the header!
skip_idx = random.sample(range(1, num_lines), num_lines - Rows)

In [None]:
%%time
# Read a sample of the data, in batches
DataChunk = pd.read_csv(FileName, skiprows=skip_idx, chunksize=100000, sep=',', encoding='utf-8') #latin1 didnt work for accents
dfList = []
for chunk in DataChunk:
    dfList.append(pd.DataFrame(chunk))
    print(chunk.shape, type(chunk))
    del chunk
bd = pd.concat(dfList,sort=False)
del DataChunk

In [None]:
# ---------- To load files in batches (chunks) into Jupyter:
#change this      movies_tv = pd.read_json("movies_crop.csv", lines=True)
#for this
iterator = pd.read_csv('movies_crop.csv', chunksize = 100000)
movies_tv = pd.concat(iterator)
del iterator

#If you’re using the compressed csv files the solution is similar,
iterator = pd.read_json('reviews_Movies_and_TV_5.json.gz', lines = True, compression = 'gzip', chunksize = 100000)
movies_tv = pd.concat(iterator)
del iterator

# This is the size of the batch. The number depends on the system you are working with, and the shape of the file. 
ChunkSize = 10 ** 5 

#https://pythonspeed.com/articles/chunking-pandas/

# structure program as:
#for chunk in pd.read_csv(FileName, chunksize=ChunkSize):
#    process(chunk)

#That should load the entire dataset and delete the iterator construct to free the RAM

In [144]:
%%time
DataChunk = pd.read_csv(FileName, chunksize=100000, sep=',', encoding='latin1') #utf-8
dfList = []
for chunk in DataChunk:
    dfList.append(pd.DataFrame(chunk))
    print(chunk.shape, type(chunk))
    del chunk                         # You have to liberate memory, otherwise, it will crash the kernel
bd = pd.concat(dfList,sort=False)     # You can also convert to dataframe and process inside the for loop.
del DataChunk                         # You have to liberate memory, otherwise, it will crash the kernel 

(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.cor



(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.core.frame.DataFrame'>
(100000, 47) <class 'pandas.cor

In [None]:
# NOT WORK

bd = pd.concat((chunk for chunk in pd.read_csv(FileName, dtype='str', sep=',', encoding='latin1', chunksize=5000))) #error_bad_lines=False
# Dataset is now stored in a Pandas Dataframe

# ERROR ERROR: /usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:2718: DtypeWarning: Columns (1,11,12,13) have mixed types.Specify dtype option on import or set low_memory=False.
#  interactivity=interactivity, compiler=compiler, result=result)

In [None]:
# NOT WORK
# Load in batches into Colab:
bd = pd.read_csv( FilePath, encoding = 'utf-8', nrows = ChunkSize )  # nrows is a sample


## To Load CSV from AWS S3 bucket

In [None]:
# Access key to the AWS Server:
AWS_KEY_ID="XXXXXXXXXXXXXXX"
AWS_SECRET="XXXXXXXXXXXXXXX"
bucket = "XXXXXXXXXXXXXXX"
key= "XXXXXXXXXXXXXXX"

In [None]:
# Configure Boto3 to interfase with AWS server:
s3 = boto3.client("s3", 
                  region_name='us-east-1', 
                  aws_access_key_id=AWS_KEY_ID, 
                  aws_secret_access_key=AWS_SECRET)

In [None]:
# Read Full file from S3 Bucket:
read_file = s3.get_object(Bucket=bucket,Key=key)
# print(read_file['Body'])
rows = 1000000
bd = pd.read_csv(read_file['Body'], nrows=rows)  #nrows is a sample

## Working with compressed files

In [None]:
import os
import zipfile

# Save a single file:
jungle_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\jungle.zip', 'w')
jungle_zip.write('C:\\Stories\\Fantasy\\jungle.pdf', compress_type=zipfile.ZIP_DEFLATED)
jungle_zip.close()

# Extract a single file:         
fantasy_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip')
fantasy_zip.extract('Fantasy Jungle.pdf', 'C:\\Stories\\Fantasy')
fantasy_zip.close()

with ZipFile('foo.zip', 'r') as zf:
    zf.extractall('destination_path/')

# Extract All:
fantasy_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip')
fantasy_zip.extractall('C:\\Library\\Stories\\Fantasy')
jungle_zip.close()

# Write all PDFs from a folder 
fantasy_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip', 'w')
for folder, subfolders, files in os.walk('C:\\Stories\\Fantasy'):
    for file in files:
        if file.endswith('.pdf'):
            fantasy_zip.write(os.path.join(folder, file), os.path.relpath(os.path.join(folder,file), 'C:\\Stories\\Fantasy'), compress_type = zipfile.ZIP_DEFLATED)
 
fantasy_zip.close()

In [None]:
# Read compressed json files (with Gz)
instant_video = pd.read_json("reviews_Amazon_Instant_Video.json.gz", lines=True, compression='gzip')
print('instant_video loaded')

Using uncompressed, but cropped files:
instant_video = pd.read_csv('instvideo_crop.csv', infer_datetime_format = True, parse_dates = True)
instant_video['datetime'] = pd.to_datetime(instant_video['datetime'], format="%Y-%m-%d")
instant_video['helpful'] = pd.DataFrame(instant_video['helpful'].str.replace(pat = "['\[\] ]", repl = '').str.split(pat = ',', expand = False))


## Using Feather Format
Feather format is designed to interoperability between Python and R. Its reading and writing speeds are much more improved when working with Pandas categorical columns. 

Using feather enables faster I/O speeds and less memory. However, since it is an evolving format it is recommended to use it for quick loading and transformation related data processing rather than using it as a long term storage.

In [148]:
!pip install feather-format
#!pip install feather-format
import feather

Collecting feather-format
  Downloading https://files.pythonhosted.org/packages/67/e8/ee99f142f19d35588501943510f8217f9dd77184574b0c933c53218e0f19/feather-format-0.4.1.tar.gz
Collecting pyarrow>=0.4.0 (from feather-format)
[?25l  Downloading https://files.pythonhosted.org/packages/3a/9b/887d1d03d3d43706dee3a71cdad9f9bbb8fe74fc93d8db5d663f5bf34e48/pyarrow-1.0.1-cp36-cp36m-manylinux1_x86_64.whl (16.6MB)
[K    100% |████████████████████████████████| 16.6MB 1.4MB/s ta 0:00:011    78% |█████████████████████████       | 13.0MB 2.5MB/s eta 0:00:02
Building wheels for collected packages: feather-format
  Running setup.py bdist_wheel for feather-format ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/0b/c4/c6/2b568ff878182aab814bcfd3db31fdd0055a4b2a249a7921eb
Successfully built feather-format
Installing collected packages: pyarrow, feather-format
Successfully installed feather-format-0.4.1 pyarrow-1.0.1


In [None]:
FilePath = "./filename.ftr" # or 'my_data.feather'

# Write: Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file
dataFrame.to_feather(FilePath) # You can put additional keywords as compression, compression_level, chunksize

# Read: columns specify a sequence of columns to read. use_Threads is to parallelize
readFrame = pd.read_feather(FilePath, columns=None, use_threads=True)
print(readFrame)

#Using only pandas:
# Write alternative
feather.write_dataframe(df, FilePath)
# Read Alternative:
df = feather.read_dataframe(path)

In [None]:
import feather
FilePath = "./placetopayDB2.ftr"
#feather.write_dataframe(bd, FilePath)
bd.to_feather(FilePath)

In [None]:
# To read feather-format:
bd = pd.read_feather("./placetopayDB4.ftr", columns=None, use_threads=True)

## Using Pickles

Pickle files are a type of file (and a package included in pandas) that allows to store many kinds of objects (but not all) in hard disk to be read again after.

Using this technique is possible to run parts of your code in different scripts or different Jupyter notebooks or even different machines and integrate again after they run. More details:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_pickle.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_pickle.html https://docs.python.org/3/library/pickle.html


In [None]:
# To use that you can do this:
df.to_pickle('path_filename')

#And then when you need to restart your notebook you can simply read directly from pickle:
df = pd.read_pickle('path_filename')

# If the file is very large the pickle file will be compressed with the parameter compression = 'gzip'.

In [None]:
bd.to_pickle('./placetopayDB2_pickle')

## How to DASK 