# Data Test

---
#### **Part I: File Parsing**
---

 - **Question 1**

> The accompanying file, data_test.gz, is a compressed file containing 2,000,000 rows of data, with the following fields:
- id
- username
- posted_datetime
- comments
>
> To import this file into our database, we need the file to be tab-delimited. Unfortunately, the tech person at the client  site pulled the data and used the ESC character as the delimiter. We need to clean up this file and replace the ESC delimiter with a tab.
You can use whatever language or tool you want, but show how you’d create the new, cleaned file.

 - **Question 2**

> Turns out tab-delimited isn’t going to work out either. We need the file to be in CSV format. The trick here is that the comments column in the file has data containing commas. So we need to be sure to escape things properly. You can use whatever language or tool you want, but show how you’d create the new, cleaned file.

---

In [1]:
ls

Data_test_2018_08_01.ipynb       data_test_2m.esc
Data_test_2018_08_01.ipynb copy  output.csv
comma_delim_clean.csv            tab_delim_clean.csv
comma_delim_trunc.csv


#### Load imports, Pandas for data manipulation and the "csv standard library" for read/write capabilities.

In [3]:
import pandas as pd
import csv
pd.set_option('display.max_colwidth', -1) # do not truncate comments row, need to see this for cleaning

#### Read the data file into Pandas using the ESC delimiter ("\33" not working either, will need to get to the bottom of this but can clean easily for now)

In [4]:
df = pd.read_csv('data_test_2m.esc', sep='\3')

#### Check the number of rows/columns

In [5]:
print('Rows: ', df.shape[0], '\nColumns: ', df.shape[1])

Rows:  2000000 
Columns:  4


#### Check data types and column names

In [6]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000000 entries, 0 to 1999999
Data columns (total 4 columns):
id                  int64
3username           object
3posted_datetime    object
3comments           object
dtypes: int64(1), object(3)
memory usage: 584.5 MB


#### Rename columns

In [7]:
df.columns = ['id', 'username', 'posted_datetime', 'comments']

In [None]:
# TODO: Would like to verify all cells begin with a 3 before editing directly
# assert df.loc[:, 'username'][0][0] == '3'

#### Drop the '3' left over from the ESC delimiter in front of every cell from the columns: 'username', 'posted_datetime', and 'comments'.
 - For every cell in the column, replace with the cell's second character to the end of the cell

In [8]:
df.loc[:, 'username'] = df.loc[:, 'username'].str[1:] 

In [9]:
df.loc[:, 'posted_datetime'] = df.loc[:, 'posted_datetime'].str[1:] 

In [10]:
df.loc[:, 'comments'] = df.loc[:, 'comments'].str[1:] 

#### Check for nulls

In [11]:
df.isnull().sum()

id                 0
username           0
posted_datetime    0
comments           0
dtype: int64

#### Drop commas and backslashes from comments column for SQL

In [12]:
df['comments'] = df['comments'].str.replace(',', '')

In [13]:
df['comments'] = df['comments'].str.replace('\\', '')

#### Replace the index Pandas created by default with the "id" column, this will be used at the primary key in SQL and must be unique, so make sure.

In [16]:
df.head(1)

Unnamed: 0,id,username,posted_datetime,comments
0,14414747,wichcraft,2017-05-25 01:38:41,i am curious where you get the quote from for each book.


In [17]:
df.loc[:, 'id'].is_unique

True

In [18]:
df.set_index('id', inplace=True)

#### Have another look, should be clean

In [19]:
df.head()

Unnamed: 0_level_0,username,posted_datetime,comments
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
14414747,wichcraft,2017-05-25 01:38:41,i am curious where you get the quote from for each book.
1652097,brandnewlow,2010-09-01 07:01:24,Very fair. Duly noted.
2980343,darrenkopp,2011-09-10 00:26:41,We're pretty good at it now. A=Republicans B=Democrats. We switch every couple of election cycles and see how it goes.
5573111,SG-,2013-04-18 21:10:58,OVH also has a datacenter in Montreal.
17088135,namibj,2018-05-17 02:06:42,Blockchain? No seriously just a block-oriented write-ahead-log replicated to the towers allowing them to cheaply-ish verify a proof-of-traffic quota.


#### Save as clean TAB delimited file named 'tab_delim_clean.csv'

In [20]:
df.to_csv('tab_delim_clean.csv', sep='\t')

#### Save as clean comma delimited file named 'comma_delim_clean.csv'

In [21]:
df.to_csv('comma_delim_clean.csv', sep=',')

#### Save as truncated, clean comma delimited file named 'comma_delim_trunc.csv' for testing

In [22]:
df.head(500).to_csv('comma_delim_trunc.csv', sep=',')

#### Test load 'comma_delim_clean.csv' with default seperator ','

In [33]:
df2 = pd.read_csv('comma_delim_clean.csv', index_col='id')
# FutureWarning is OK:
# https://stackoverflow.com/questions/40659212/futurewarning-elementwise-comparison-failed-returning-scalar-but-in-the-futur

  mask |= (ar1 == a)


In [34]:
df2.shape

(2000000, 3)

In [35]:
df2.head()

Unnamed: 0_level_0,username,posted_datetime,comments
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
14414747,wichcraft,2017-05-25 01:38:41,i am curious where you get the quote from for each book.
1652097,brandnewlow,2010-09-01 07:01:24,Very fair. Duly noted.
2980343,darrenkopp,2011-09-10 00:26:41,We're pretty good at it now. A=Republicans B=Democrats. We switch every couple of election cycles and see how it goes.
5573111,SG-,2013-04-18 21:10:58,OVH also has a datacenter in Montreal.
17088135,namibj,2018-05-17 02:06:42,Blockchain? No seriously just a block-oriented write-ahead-log replicated to the towers allowing them to cheaply-ish verify a proof-of-traffic quota.


---
## **Part II: SQL**
---

 - **Question 3**

> You’ve cleaned and imported the file above into the database successfully. Great! The data is stored in a table called user_comments. Now you want to list the top-10 most prolific posters, by username. Write a query that produces this result.

In [None]:
# SQL
SELECT username, count(DISTINCT comments) as num_posts
FROM user_comments
GROUP BY username
ORDER BY num_posts DESC LIMIT 10

 ---
 - **Question 4**

> There’s another table in your database called users that has the following columns:
 - username
 - name
 - is_vip
 - joined_datetime
>
> Write a query that updates the users table so that only the top-10 posters have a value for is_vip.

In [None]:
CREATE #temptable (
    username VARCHAR(30)
    ,num_posts INT
    );

INSERT INTO #temptable (
    SELECT username
    ,count(DISTINCT comments) FROM user_comments GROUP BY username ORDER BY num_posts DESC LIMIT 10
    ); 

UPDATE users
SET is_vip = 1
WHERE username IN (
        SELECT username
        FROM #temptable
        );

DROP TABLE #temptable;

---
- **Question 5**

> Using both the users and user_comments table, write a query to calculate what percentage of comments were made in the first 30  > days of the users account.

get number of comments in the first 30 days 

get total number of comments 

divide the 2 and multiply by 100

In [None]:
# SQL
CREATE #temp_first_month_posts (
    username VARCHAR(30)
    ,num_posts INT
    );

INSERT INTO #temp_first_month_posts (
    SELECT joined_date, posted_datetime FROM users, user_comments
    ); 

    # SQL
CREATE #temp_total_posts (
    username VARCHAR(30)
    ,num_posts INT
    );

INSERT INTO #temp_total_posts (
    SELECT username, count(DISTINCT comments) as num_posts FROM user_comments GROUP BY username ORDER BY num_posts 
    ); 