# Pipeline Prototype

## Data Wrangling

In [22]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Files in the current directory : 

In [3]:
import os
print(os.listdir())

['.git', '.gitignore', '.ipynb_checkpoints', 'pipeline.ipynb', 'README.md', 'sample.tsv', 'schema.txt']


Import the schema file :

In [11]:
schema = pd.read_csv('schema.txt', delim_whitespace=True, header=None)
schema

Unnamed: 0,0,1,2,3,4,5
0,1,id,bigint(20),UNSIGNED,No,
1,2,userId,bigint(20),UNSIGNED,No,
2,3,createdAt,timestamp,No,0000-00-00,00:00:00
3,4,text,text,utf8_unicode_ci,No,
4,5,longitude,float,Yes,,
5,6,latitude,float,Yes,,
6,7,placeId,varchar(25),utf8_general_ci,Yes,
7,8,inReplyTo,bigint(20),UNSIGNED,Yes,
8,9,source,int(10),UNSIGNED,No,
9,10,truncated,bit(1),No,,


This gives us the position, size and format for each attribute of the data.

Importing the data sample as provided doesn't work, it seems there are some inconsistency in the data formatting than pandas can't resolve by reading it. So we need to play a little bit with Trifacta in order to get something readable by pandas. We perform the following Trifacta commands on the sample data : 

`splitrows col: column1 on: '\n'
split col: column1 on: '\t' limit: 20
settype col: column7 type: 'Float'
settype col: column6 type: 'Float'
drop col: column11
`

We see that the last line of the script drops the 11th column of the data. Inspecting this column with Trifacta shows that it's empty. Actually by analysing the raw `.tsv` file, we can see that if the `truncated` feature is 0, then it's not present in the row, so Trifacta can't tell the difference between the truncated and the source feature, which are merged together. We decide then to drop the empty 11th column, mainly because if we import it as it is with pandas, we get some problem with the following columns.

Now we are able to import it with pandas : 

In [84]:
import csv
#df = pd.read_csv('sample.txt', sep='\t', header=None, quoting=csv.QUOTE_NONE)
df = pd.read_csv('sample.csv', na_values='\\N', parse_dates=['column4'])
df

Unnamed: 0,column2,column3,column4,column5,column6,column7,column8,column9,column10,column12,column13,column14,column15,column16,column17,column18,column19,column20,column21,column22
0,776522983837954049,735449229028675584,2016-09-15 20:48:01,se lo dici tu... https://t.co/x7Qm1VHBKL,,,51c0e6b24c64e54e,,1,46.0027,8.96044,Twitter for iPhone,http://twitter.com/#!/download/iphone,plvtone filiae.,hazel_chb,146,110,28621,Earleen.,
1,776523000636203010,2741685639,2016-09-15 20:48:05,https://t.co/noYrTnqmg9,,,4e7c21fd2af027c6,,1,46.8131,8.22414,Twitter for iPhone,http://twitter.com/#!/download/iphone,samara,letisieg,755,2037,3771,Suisse,
2,776523045200691200,435239151,2016-09-15 20:48:15,@BesacTof @Leonid_CCCP Tu dois t'engager en si...,,,12eb9b254faf37a3,7.765221e+17,5,47.2010,5.94082,Twitter for Android,http://twitter.com/download/android,lebrübrü❤,lebrubru,811,595,30191,Fontain,
3,776523058404290560,503244217,2016-09-15 20:48:18,@Mno0or_Abyat اشوف مظاهرات على قانون العمل الج...,,,30bcd7f767b4041e,7.765216e+17,1,45.8011,6.16552,Twitter for iPhone,http://twitter.com/#!/download/iphone,عبدالله القنيص,bingnais,28433,417,12262,Shargeyah,
4,776523058504925185,452805259,2016-09-15 20:48:18,Greek night #geneve (@ Emilios in Genève) http...,6.14414,46.1966,c3a6437e1b1a726d,,3,46.2048,6.14319,foursquare,http://foursquare.com,Alkan Şenli,Alkanoli,204,172,3390,İstanbul/Burgazada,
5,776523071025012736,16416746,2016-09-15 20:48:21,@gregorypons #BusinessMontresVision https://t....,,,c3a6437e1b1a726d,7.765209e+17,18777,46.2048,6.14319,Twitter Web Client,http://twitter.com,Gregory PONS,gregorypons,2398,305,14917,Geneva + watchmaking planet,
6,776523092768219137,735449229028675584,2016-09-15 20:48:27,dillo https://t.co/hScjeZbi4c,,,51c0e6b24c64e54e,,1,46.0027,8.96044,Twitter for iPhone,http://twitter.com/#!/download/iphone,plvtone filiae.,hazel_chb,146,110,28621,Earleen.,
7,776523105007177728,2442105406,2016-09-15 20:48:29,Miii le voci nere.. Che meraviglia.. #XF10,,,3f0b5e0668b2fd3c,,5,45.8865,9.64878,Twitter for Android,http://twitter.com/download/android,Roberta Perani,robertabg72,971,1753,8181,,
8,776523126729474048,101489921,2016-09-15 20:48:35,@Manu_Aka_Manny Sorry,,,9b05f50adb666c0e,7.765209e+17,1,45.8327,8.77107,Twitter for iPhone,http://twitter.com/#!/download/iphone,Martina Ferraiuolo,martiferraiuolo,530,308,9610,"Varese, Italy",
9,776523134409244673,715791830294650880,2016-09-15 20:48:37,Je veut ca https://t.co/NZpSScxQ70,,,3f056fa44e682001,,5,47.7170,7.34480,Twitter for Android,http://twitter.com/download/android,analslut,doggyboynude,1206,2503,815,france,
