# Arvo to PostgreSQL

## Generating a fully-functional CSV. (Err... repairing)

When my database processing script ran and saved as CSV something happened and corrupted the CSV. 

As a result it only contained 44,000 rows. 

Luckily I also saved a version as a .avro file. 

Here's the steps I took to make this work. 
* I pulled the .avro file into pandas. 
* Traverse the array cell by cell and ensure that it is endcoding correctly.  
* Then I processed the file into a CSV.
* Used my terminal & psql to send the csv to our AWS RDS PostgreSQL instance.

Here's the terminal commands I used to upload the CSV:

``` sh
foo@bar:~$ brew install postgres

foo@bar:~$ sudo mkdir -p /etc/paths.d && echo /Applications/Postgres.app/Contents/Versions/latest/bin | sudo tee /etc/paths.d/postgresapp

foo@bar:~$ psql "host=awsinstancename.awslocation.rds.amazonaws.com port=5432 dbname=lambdaRPG user=lambdaRPG"

foo@bar:~$ lambdaRPG=> \copy commentor_data (commentor, commentor_sentiment, commentor_total_happyness, commentor_total_saltiness, commentor_upvotes_mean, commentor_upvotes_total, qty_non_salty_comments, qty_salty_comments, salty_comments, sweet_comments, total_comments) from 'commentor_data.csv' CSV HEADER;
COPY 183926
```

I will likely always export my important df outputs to more than one filetype from now on as a precaution (in my personal projects). 

### Import and install packages

In [1]:
#!pip install pandavro
import pandas as pd
import sqlite3
import psycopg2
import pandavro as pdx
import sqlalchemy
from sqlalchemy import create_engine
from sqlite3 import dbapi2 as sqlite
from tqdm import tqdm, tqdm_pandas
import pandavro as pdx

### Import and check dataframe shape

In [2]:
df = pdx.read_avro('data/hn_commentors_db.avro')
df.shape

(183926, 11)

In [3]:
df.tail()

Unnamed: 0,commentor,commentor_sentiment,commentor_total_happyness,commentor_total_saltiness,commentor_upvotes_mean,commentor_upvotes_total,qty_non_salty_comments,qty_salty_comments,salty_comments,sweet_comments,total_comments
183921,anonn,0.097743,0.097743,0.0,0.0,0,1,0,"[{""time"":1243903124,""comment_sentiment"":0.0977...","[{""time"":1243903124,""comment_sentiment"":0.0977...",1
183922,tikl1,-0.166667,0.0,-0.166667,3.0,3,0,1,"[{""time"":1427198521,""comment_sentiment"":-0.166...","[{""time"":1427198521,""comment_sentiment"":-0.166...",1
183923,autismjohndoe,0.195139,0.195139,0.0,3.0,3,1,0,"[{""time"":1444150317,""comment_sentiment"":0.1951...","[{""time"":1444150317,""comment_sentiment"":0.1951...",1
183924,alexf4v2,0.170455,0.170455,0.0,5.0,5,1,0,"[{""time"":1309461809,""comment_sentiment"":0.1704...","[{""time"":1309461809,""comment_sentiment"":0.1704...",1
183925,ilikedata,0.39375,0.39375,0.0,0.0,0,1,0,"[{""time"":1349098841,""comment_sentiment"":0.3937...","[{""time"":1349098841,""comment_sentiment"":0.3937...",1


### Make sure it isn't an encoding issue. Check each cell individually.

In [4]:
# This little section of code makes sure everything is encoding/decoding correctly. 
for column in df.columns:
    for idx in df[column].index:
        x = df.get_value(idx,column)
        try:
            x = x if type(x) == str else str(x).encode('utf-8','ignore').decode('utf-8','ignore')
            df.set_value(idx,column,x)
        except Exception:
            print('encoding error: {0} {1}'.format(idx,column))
            df.set_value(idx,column,'')
            continue



### Export the clean, intact data back to csv. 

In [5]:
df.to_csv('data/commentor_data_repaired.csv',index=False)

### Reimport the data from CSV to DataFrame and inspect.

In [6]:
df2 = pd.read_csv("data/commentor_data_repaired.csv")
df2.tail()

Unnamed: 0,commentor,commentor_sentiment,commentor_total_happyness,commentor_total_saltiness,commentor_upvotes_mean,commentor_upvotes_total,qty_non_salty_comments,qty_salty_comments,salty_comments,sweet_comments,total_comments
183921,anonn,0.097743,0.097743,0.0,0.0,0,1,0,"[{""time"":1243903124,""comment_sentiment"":0.0977...","[{""time"":1243903124,""comment_sentiment"":0.0977...",1
183922,tikl1,-0.166667,0.0,-0.166667,3.0,3,0,1,"[{""time"":1427198521,""comment_sentiment"":-0.166...","[{""time"":1427198521,""comment_sentiment"":-0.166...",1
183923,autismjohndoe,0.195139,0.195139,0.0,3.0,3,1,0,"[{""time"":1444150317,""comment_sentiment"":0.1951...","[{""time"":1444150317,""comment_sentiment"":0.1951...",1
183924,alexf4v2,0.170455,0.170455,0.0,5.0,5,1,0,"[{""time"":1309461809,""comment_sentiment"":0.1704...","[{""time"":1309461809,""comment_sentiment"":0.1704...",1
183925,ilikedata,0.39375,0.39375,0.0,0.0,0,1,0,"[{""time"":1349098841,""comment_sentiment"":0.3937...","[{""time"":1349098841,""comment_sentiment"":0.3937...",1


In [7]:
df2.shape

(183926, 11)

# AND IT WORKED! 