# Data Test

---
 - Data: https://drive.google.com/drive/folders/1nbabX6sSNQ4fvMYWDpDX511qUhFJJWCL

---

#### **Part I: File Parsing**


 - **Question 1**

> The accompanying file, data_test.gz, is a compressed file containing 2,000,000 rows of data, with the following fields:
- id
- username
- posted_datetime
- comments
>
> To import this file into our database, we need the file to be tab-delimited. Unfortunately, the tech person at the client  >  > site pulled the data and used the ESC character as the delimiter. We need to clean up this file and replace the ESC delimiter > with a tab.
You can use whatever language or tool you want, but show how you’d create the new, cleaned file.

 - **Question 2**

> Turns out tab-delimited isn’t going to work out either. We need the file to be in CSV format. The trick here is that the comments column in the file has data containing commas. So we need to be sure to escape things properly.
>
> You can use whatever language or tool you want, but show how you’d create the new, cleaned file.

---
#### **Part II: SQL**

 - **Question 3**

> You’ve cleaned and imported the file above into the database successfully. Great! The data is stored in a table called user_comments. Now you want to list the top-10 most prolific posters, by username. Write a query that produces this result.


 - **Question 4**

> There’s another table in your database called users that has the following columns:
 - username
 - name
 - is_vip
 - joined_datetime
>
> Write a query that updates the users table so that only the top-10 posters have a value for is_vip.

- **Question 5**

> Using both the users and user_comments table, write a query to calculate what percentage of comments were made in the first 30 days of the users account.

In [39]:
ls

Data_test_2018_08_01.ipynb  data_test_2m.esc copy
comma_delim_clean.csv       output.csv
data_test_2m.esc            tab_delim_file.csv


#### Load imports, Pandas for data manipulation and the "csv standard library" for read/write capabilities

In [40]:
import pandas as pd
import csv

#### Read the data file row by row, replace the delimiters (fom ESC to TAB), then save as a new file (output.csv)

In [41]:
reader = csv.reader(open("data_test_2m.esc", "r"), delimiter='\3')
writer = csv.writer(open("output.csv", 'w'), delimiter='\t')
writer.writerows(reader)

#### Load into Pandas the new TAB delimited file to inspect

In [42]:
df = pd.read_csv('output.csv', sep='\t')

#### Check the number of rows/columns

In [43]:
print('Rows: ', df.shape[0], '\nColumns: ', df.shape[1])

Rows:  1985554 
Columns:  4


#### Check data types and column names

In [44]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1985554 entries, 0 to 1985553
Data columns (total 4 columns):
id                  int64
3username           object
3posted_datetime    object
3comments           object
dtypes: int64(1), object(3)
memory usage: 580.3 MB


#### Rename columns

In [45]:
df.columns = ['id', 'username', 'posted_datetime', 'comments']

In [46]:
df.head()

Unnamed: 0,id,username,posted_datetime,comments
0,14414747,3wichcraft,32017-05-25 01:38:41,3i am curious where you get the quote from for...
1,1652097,3brandnewlow,32010-09-01 07:01:24,3Very fair. Duly noted.
2,2980343,3darrenkopp,32011-09-10 00:26:41,"3We're pretty good at it now. A=Republicans, B..."
3,5573111,3SG-,32013-04-18 21:10:58,3OVH also has a datacenter in Montreal.
4,17088135,3namibj,32018-05-17 02:06:42,"3Blockchain? No, seriously, just a block-orien..."


In [47]:
# Would like to verify all cells begin with a 3 before editing directly
assert df.loc[:, 'username'][0][0] == '3'

#### Drop the '3' left over from the ESC delimiter in front of every cell from the columns: 'username', 'posted_datetime', and 'comments'.
 - For every cell in the column, replace with the cell's second character to the end of the cell

In [48]:
df.loc[:, 'username'] = df.loc[:, 'username'].str[1:] 

In [49]:
df.loc[:, 'posted_datetime'] = df.loc[:, 'posted_datetime'].str[1:] 

In [50]:
df.loc[:, 'comments'] = df.loc[:, 'comments'].str[1:] 

#### Check for nulls

In [51]:
df.isnull().sum()

id                 0
username           0
posted_datetime    0
comments           0
dtype: int64

#### Have another look, should be clean

In [52]:
df.head(10)

Unnamed: 0,id,username,posted_datetime,comments
0,14414747,wichcraft,2017-05-25 01:38:41,i am curious where you get the quote from for ...
1,1652097,brandnewlow,2010-09-01 07:01:24,Very fair. Duly noted.
2,2980343,darrenkopp,2011-09-10 00:26:41,"We're pretty good at it now. A=Republicans, B=..."
3,5573111,SG-,2013-04-18 21:10:58,OVH also has a datacenter in Montreal.
4,17088135,namibj,2018-05-17 02:06:42,"Blockchain? No, seriously, just a block-orient..."
5,3213234,fleitz,2011-11-08 22:46:21,And what about the short term holders of bank ...
6,16127705,AlexCoventry,2018-01-11 20:49:57,Fraud is not the only issue. OP needs to asses...
7,12966210,yitchelle,2016-11-16 11:53:36,"I got toilet, so I drew a hole in the ground. ..."
8,9341951,chatmasta,2015-04-08 16:33:45,This sounds fascinating. Do you have a link to...
9,12771255,BlytheSchuma,2016-10-23 00:30:19,"Wow, you somehow made this about you again."


#### Let's replace the index Pandas created by default with the posted_datetime

#### Save as clean TAB delimited file named 'tab_delim_file.csv'

In [53]:
df.to_csv('tab_delim_clean.csv')

#### Save as an actual comma delimited file named 'comma_delim_file.csv'

In [None]:
df.to_csv('comma_delim_clean.csv', sep=',')

In [55]:
df.head(500).to_csv('comma_delim_trunc.csv', sep=',')

#### Load 'comma_delim_clean.csv' with default seperator ','

In [4]:
df2 = pd.read_csv('comma_delim_clean.csv', index_col='posted_datetime')

In [5]:
df2.head()

Unnamed: 0_level_0,Unnamed: 0,id,username,comments
posted_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017-05-25 01:38:41,0,14414747,wichcraft,i am curious where you get the quote from for ...
2010-09-01 07:01:24,1,1652097,brandnewlow,Very fair. Duly noted.
2011-09-10 00:26:41,2,2980343,darrenkopp,"We're pretty good at it now. A=Republicans, B=..."
2013-04-18 21:10:58,3,5573111,SG-,OVH also has a datacenter in Montreal.
2018-05-17 02:06:42,4,17088135,namibj,"Blockchain? No, seriously, just a block-orient..."


#### The index column created automatically remains as 'Unnamed: 0', I will drop it here but should to go back and fix its initial creation.

In [6]:
df2.drop('Unnamed: 0', axis=1, inplace=True)

In [7]:
df2 = df2.sort_index(ascending=True)

In [8]:
df2.head(25)

Unnamed: 0_level_0,id,username,comments
posted_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2006-10-09 19:52:45,17,pg,Is there anywhere to eat on Sandhill Road?
2006-10-10 02:18:22,22,pg,It's kind of funny that Sevin Rosen is giving ...
2006-10-10 15:50:40,33,spez,winnar winnar chicken dinnar!
2006-10-10 15:53:53,34,pg,what do you mean? this story's still not #1
2006-10-10 22:46:08,41,starklysnarky,it's interesting how a simple set of features ...
2007-02-19 04:49:59,79,matt,"Launch party circuit, eh?"
2007-02-19 04:51:22,80,pg,yet another advantage of being in the Bay Area
2007-02-20 06:19:53,141,timg,news.ycombinator should let us redesign arc.
2007-02-20 06:26:31,143,Dauntless,3. You can't delete your comment if you posted...
2007-02-20 06:26:37,144,Dauntless,3. You can't delete your comment if you posted...


In [37]:
df2.index

Index(['2006-10-09 19:52:45', '2006-10-10 02:18:22', '2006-10-10 15:50:40',
       '2006-10-10 15:53:53', '2006-10-10 22:46:08', '2007-02-19 04:49:59',
       '2007-02-19 04:51:22', '2007-02-20 06:19:53', '2007-02-20 06:26:31',
       '2007-02-20 06:26:37',
       ...
       '2018-07-30 08:59:32', '2018-07-30 09:01:21', '2018-07-30 09:02:16',
       '2018-07-30 09:03:49', '2018-07-30 09:05:39', '2018-07-30 09:06:47',
       '2018-07-30 09:10:26', '2018-07-30 09:11:43', '2018-07-30 09:20:28',
       '2018-07-30 09:22:17'],
      dtype='object', name='posted_datetime', length=2000000)

In [29]:
df2.index =  pd.to_datetime(df2.index)

In [39]:
df2.index

DatetimeIndex(['2006-10-09 19:52:45', '2006-10-10 02:18:22',
               '2006-10-10 15:50:40', '2006-10-10 15:53:53',
               '2006-10-10 22:46:08', '2007-02-19 04:49:59',
               '2007-02-19 04:51:22', '2007-02-20 06:19:53',
               '2007-02-20 06:26:31', '2007-02-20 06:26:37',
               ...
               '2018-07-30 08:59:32', '2018-07-30 09:01:21',
               '2018-07-30 09:02:16', '2018-07-30 09:03:49',
               '2018-07-30 09:05:39', '2018-07-30 09:06:47',
               '2018-07-30 09:10:26', '2018-07-30 09:11:43',
               '2018-07-30 09:20:28', '2018-07-30 09:22:17'],
              dtype='datetime64[ns]', name='posted_datetime', length=2000000, freq=None)

In [54]:
%timeit df2.loc[:, 'comments'].str.len().max()

987 ms ± 52.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [55]:
%timeit df2.loc[:, 'comments'].map(len).max()

960 ms ± 8.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [104]:
df2.groupby('username')

ValueError: Cannot index with multidimensional key

---
## SQL
---

#### List the top-10 most prolific posters

In [83]:
import csv_to_sqlite as ts

In [56]:
import sqlite3

In [57]:
ls

Data_test_2018_08_01.ipynb  data_test_2m.esc copy
comma_delim_clean.csv       output.csv
comma_delim_trunc.csv       tab_delim_clean.csv
data_test_2m.esc            tab_delim_file.csv


In [92]:
ts -f 'comma_delim_trunc.csv'

SyntaxError: invalid syntax (<ipython-input-92-3eae98b82433>, line 1)

In [79]:
f=open('comma_delim_trunc.csv','r') # open the csv data file
next(f, None) # skip the header row
reader = csv.reader(f)

sql = sqlite3.connect('example.db')
cur = sql.cursor()

cur.execute('''CREATE TABLE IF NOT EXISTS utterances
            (username)''') # create the table if it doesn't already exist

for row in reader:
    cur.execute("INSERT INTO utterances VALUES (?, ?, ?, ?)", row)

f.close()
sql.commit()
sql.close()

ProgrammingError: Incorrect number of bindings supplied. The current statement uses 4, and there are 5 supplied.

In [26]:
top_posters = df2.loc[:, 'username'].value_counts().head(10)

In [27]:
top_posters

tptacek        6491
jacquesm       4401
eru            4212
pjmlp          3405
pg             2805
wmf            2774
jrockway       2618
Tichy          2588
gaius          2572
icebraining    2361
Name: username, dtype: int64

In [24]:
df2['is_vip'] = [True if x in top_posters.index else False for x in df2.loc[:, 'username']]

In [25]:
df2.head()

Unnamed: 0_level_0,id,username,comments,is_vip
posted_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2006-10-09 19:52:45,17,pg,Is there anywhere to eat on Sandhill Road?,True
2006-10-10 02:18:22,22,pg,It's kind of funny that Sevin Rosen is giving ...,True
2006-10-10 15:50:40,33,spez,winnar winnar chicken dinnar!,False
2006-10-10 15:53:53,34,pg,what do you mean? this story's still not #1,True
2006-10-10 22:46:08,41,starklysnarky,it's interesting how a simple set of features ...,False


Using both the users and user_comments table, write a query to calculate what percentage of comments were made in the first 30 days of the users account.

Question 3
You’ve cleaned and imported the file above into the database successfully. Great! The data is stored in a table called user_comments. Now you want to list the top-10 most prolific posters, by username. Write a query that produces this result.
Question 4
There’s another table in your database called users that has the following columns:
 - username
 - name
 - is_vip
 - joined_datetime

Write a query that updates the users table so that only the top-10 posters have a value for is_vip.
Question 5
Using both the users and user_comments table, write a query to calculate what percentage of comments were made in the first 30 days of the users account.

In [9]:
SELECT u.joined_date, c.posted_datetime
FROM users u, user_comments c
Where whatever between joined_datetime and DATEADD(d, 30, joined_datetime)

#insert into fruit (color_id, name)
# select 11, 'banana'
# where exists 
# (select * from fruit join colors on fruit.color_id = colors.id
# where colors.id = 11 and colors.owner = 6);