# **Spring2024_Simple_Data_Migration_from_RethinkDB_to_MariaDB Progress Report 1** 
April 26th

## Description: 
Migrate data from RethinkDB to MariaDB using Python scripts where data stored in a RethinkDB database is transferred to a MariaDB database. Develop Python scripts to retrieve data from RethinkDB and insert it into MariaDB, ensuring data integrity and consistency. Utilize the rethinkdb and mysql-connector-python libraries for database interaction, showcasing basic data migration techniques between different database systems. Add different functionalities to enhance complexity of project.

Link: https://docs.google.com/document/d/1GEOmfpBUXiCua18wR1Hx1OMUVlku-1of/edit#heading=h.4ph7eurcozb1

## About This Project:
Everything below are my personal understanding, please feel free to correct me if you see any mistakes, any comment is appreciaated, thank you!

It is my understanding, this project contains the following step:
##### 1. Pick the dataset
##### 2. Store the chosen dataset in a Rethinkdb Database
##### 3. Pull dataset from Rethinkdb Database
##### 4. Store the pulled dataset in a Mariadb Database
##### 5. Pull dataset from Mariadb Database
##### 6. Compare the chosen dataset with the pulled dataset for accuracy of migreation

I will have a section for each step to explaine what I did and how I did it. 

In [1]:
from moviepy.editor import VideoFileClip
import cv2
import numpy as np
import time
from scipy.io import savemat
import rethinkdb as r
import json
import pandas as pd
import mariadb
import sys


def split(array):
    array = np.array(array)
    red = array[:, :, 0:1]
    green = array[:, :, 1:2]
    blue = array[:, :, 2:3]
    nb = []
    nr = []
    ng = []
    for i in red:
        tp = [item for sublist in i for item in sublist]
        tp = [int(x) for x in tp]
        nr.append(tp)
        i = None
    red = None
    for j in green:
        tp = [item for sublist in j for item in sublist]
        tp = [int(x) for x in tp]
        ng.append(tp)
        j = None
    green = None
    for k in blue:
        tp = [item for sublist in k for item in sublist]
        tp = [int(x) for x in tp]
        nb.append(tp)
        k = None
    blue = None
    red = None
    green = None
    return nr, ng, nb


## 1. Pick the dataset
I have decided to choose a rather challenging dataset: An entire video. To be specific, I am trying to migrate "Never gonna give you up" from the great Rick Astley. Due to the sheer amount of data a video contains, I decided to go with the 360p version to minimize the amount of data I have to work with. All of my work so far are only limited to the video, I have not had an chance to do anything with the audio part, although it is a part of my plan. With the help of cv2(opencv-python), I was able to split te video into 5301 frames, each containing 230400 pixels. Each pixel is represented as a set of three integers corresponding to the RGP color scale. 

In [2]:
cap = cv2.VideoCapture('./Rickroll.mp4')
length = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
print("there are", length, "frames in the chosen video")

there are 0 frames in the chosen video


## 2. Store the chosen dataset in a Rethinkdb Database
here I connect this notebook to a rethinkdb database running locally, I think the ultimate goal is to access the data from a remote database, but I am still working out the kinks of that...So for now, I will be working with a locally hosted Rethindb Database. 

In [3]:
rethink = r.RethinkDB()
rethink.connect('rethinkdb', 28015).repl()

<rethinkdb.net.DefaultConnection at 0x7f7368e2e760>

As you might have noticed, I have an image of earth named earth.jpg in my code file, that is because at the begining, I wasn't able to store the entire video in the rethinkdb database, so I admitted defeat and was going to migrate a picture instead of an video. But I later (around the first woking in class session) found a way to store the video in a rethinkdb database, so the following code where I stores the picture is no longer needed. 

In [4]:
try:
    # connection parameters
    conn_params = {
        'user' : "root",
        'password' : "<196900>",
        'host' : "172.19.0.1",
        'port' : 3306,
        'database' : "<data>"
    }

    # establish a connection
    connection = mariadb.connect(**conn_params)
    cursor = connection.cursor()
    
except mariadb.Error as e:
    print(f"Error connecting to MariaDB Platform: {e}")
    sys.exit(1)
    
print(cursor)

<mariadb.cursor at 0x7f72ed8a9040>


In [14]:
new = []
im = cv2.imread("Small_Earth.jpg")
print(type(im))
red = im[:, :, 0]
green = im[:, :, 1]
blue = im[:, :, 2]
print(type(red))
r = red.reshape(2, -1).tolist()
g = green.reshape(2, -1).tolist()
b = blue.reshape(2, -1).tolist()
print(len(r[1]))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
84050


In [16]:
#rethink.db_create("film").run()
#rethink.db("film").table_create("video").run()
#rethink.db("film").table_create("audio").run()
rethink.db_create("earth").run()
rethink.db("earth").table_create("pic").run()

{'config_changes': [{'new_val': {'db': 'earth',
    'durability': 'hard',
    'id': '42366f95-f3ff-4213-8afe-52b9141a239d',
    'indexes': [],
    'name': 'pic',
    'primary_key': 'id',
    'shards': [{'nonvoting_replicas': [],
      'primary_replica': '43b3a94a615c_3zh',
      'replicas': ['43b3a94a615c_3zh']}],
    'write_acks': 'majority',
    'write_hook': None},
   'old_val': None}],
 'tables_created': 1}

In [18]:
data = {"red":r, "green":g, "blue":b}
rethink.db("earth").table("pic").insert(data).run()

{'deleted': 0,
 'errors': 0,
 'generated_keys': ['c9079786-1e2c-4d99-9fc8-376fcd1f62ec'],
 'inserted': 1,
 'replaced': 0,
 'skipped': 0,
 'unchanged': 0}

In [25]:
dt = []
tr = rethink.db('earth').table('pic').run()
for doc in tr:
    dt.append(doc)
data = dt[0]
print(type(data))

<class 'dict'>


The following code will store the entire video in a rethinkdb database. Each frame was stored in a seprate table labeled with the frame number. Since each frame is 360 by 640 pixels, there are 360 row in each table. In each row, there are three member labeled R, G and B with each member containing a 1 by 640 set. This is functional at best and I am currently working on making the process more efficent. 

In [None]:
'''for i in range(length):
    new = []
    cap.set(cv2.CAP_PROP_POS_FRAMES, i)
    ret, frame = cap.read()
    frame = np.array(frame)
    red, green, blue = split(frame)
    name = "frame_"+str(i)
    rethink.db('Project').table_create(name).run()
    for j in range(len(red)):
        new.append({"red":red[j], "green":green[j], "blue":blue[j]})
    rethink.db('Project').table(name).insert(new).run()
    print(i/length)'''

Everything from here on are my origional code, as you can see in the code.py file. This is for pure demonstreational purpous. 

In [None]:
###Avalible Code###
#Everything Above has been tested and will work    
    
#rethink.db("test").table_create("array"run()
#print(frame.ndim)
#print(length)
#new = frames
'''jsondata = json.dumps(frames)
with open("/home/ke/Desktop/database/testing/Data605Project/Python Codes/test2.json", "w") as json_file:
    json_file.write(jsondata)'''
# Serialize the list to JSON
#rethink.db("ke-B660M-AORUS-PRO-AX-DDR4").table
#rethink.connect('localhost', 28015).repl()
#print(loaded_arrays[1].ndim)
# Write the array to disk
#print(frames[12].ndim)
#newarray = np.stack(frames)
#print(newarray.shape)

###Line of Submission###

'''
red = red.tolist()
green = im[:, :, 1:2]
green = green.tolist()
blue = im[:, :, 2:3]
blue = blue.tolist()
nb = []
for k in blue:
    nb.append([item for sublist in k for item in sublist])'''
#rethink.db('Project').table_create('picture').run()
'''for i in range(len(red)):
    frames.append({"red":red[i], "green":green[i], "blue":blue[i]})
    #print(i)
print(len(red[2]))
#rethink.db('Project').table_create('NewArray').run()
#rethink.db('Project').table('Array').insert(frames).run()
array = rethink.db('Project').table('Array').pluck('red').run()
for document in array:
    new.append(document)
x = list(new[4].values())
print(x[0][4][0])'''

**I understand my progress so far is slitely behind where it should be, but there has been an incident. My old PC(2015 Macbook air with a 4 core CPU) was not up to the task(just running the script will make the keyboard how and the pc itself too hot to work with). So I decided to get a new one. The shipping time along with setting up the system took almost the entire week. I have my system up to date now and I will catch up very soon.**

## 3. Pull dataset from Rethinkdb Database

## 4. Store the pulled dataset in a Mariadb Database

## 5. Pull dataset from Mariadb Database

## 6. Compare the chosen dataset with the pulled dataset for accuracy of migreation
