This module is called ETL, since we're doing extraction from a semi-structured data source, doing some minor transformations and then saving the data in a structured format (pandas dataframe).

At the same time, I'll be doing some minor EDA to get a sense of the data characteristics. Since we're in an unsupervise setting, there's no need for a split train-test at this juncture. 

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import sklearn

# Extracting Reviews from source

In [37]:
cwd_path = os.getcwd()
print("My current directory is : " + cwd_path)
path = os.path.dirname(cwd_path)
data_path = path+"/data"

My current directory is : /home/ivo/Trabalho/Interviews/Siemens/task_1


In [13]:
df = pd.read_csv ("../reviews.txt",delimiter="\\n", header=None,
                  engine='python',names=["Reviews"])

In [14]:
df.head()

Unnamed: 0,Reviews
0,I needed a set of jumper cables for my new car...
1,"These long cables work fine for my truck, but ..."
2,Can't comment much on these since they have no...
3,I absolutley love Amazon!!! For the price of ...
4,I purchased the 12' feet long cable set and th...


In [28]:
df.loc[3].Reviews

'I absolutley love Amazon!!!  For the price of a set of cheap Booster/Jumper Cables in a brick and morter store, you can buy extra long and heavy duty jumpers!  First off, don\'t be the person that not only needs to ask a kind passer-by for a "jump" but also if they have jumper cables.  It\'s MUCH easier to get a jump start if you have your own cables.Next lets talk about sizing.  Having the longest cable possible is a major plus if your car is parked up against something like a pole or wall, or even parked on a one way street.  The "booster car" (the car w/o a dead battery) can pull in close enough to use the cables without having to manuver into some akward position.  Or better yet, you won\'t have to push your vehicle into a position to be jumped.  If your diving a normal sized car they can even pull in behind you to jump you!  Or if their vehicle is the shorter of the two, they could pull in front.  Now how about gauge?  For those who aren\'t electricians or engineers, as the numbe

The main topic of the reviews seems to be about cables, jumper cables to be more specific.

In [24]:
df["length"]=df.Reviews.apply(lambda x: len(x.split()))
# we're doing a counting of words for each review. We should check for some odd occurrences.

In [25]:
df["length"].describe()

count    20453.000000
mean        85.371388
std         99.553678
min          1.000000
25%         31.000000
50%         52.000000
75%         99.000000
max       2239.000000
Name: length, dtype: float64

We clearly have some extreme values. We have a review with just 1 word, and then with 2239 words.

In [29]:
df[df["length"]==1]

Unnamed: 0,Reviews,length
654,OK,1
806,CHEEP,1
10022,Good.,1


These seem to be valid reviews.

In [30]:
df[df["length"]>1000]

Unnamed: 0,Reviews,length
2469,WHEN TO USE EPOXY CEMENTWhen you need a strong...,2239
3050,Another reviewer mentioned this charger puts o...,1280
3922,Although usually sold for automobile bulb sock...,1088
4420,People need to understand about motor oil to m...,1799
5420,"There is a large family of Goop adhesives, ""Al...",1106
8349,Consider this... The combine cost of your tow...,1153
12059,"First, this is a good product if you want to t...",2049
14987,I know the seats in my wife and I's cars have ...,1046
14999,In an attempt to keep my new leather dash from...,1037
15332,"Edit: I had the jack for about 3 weeks, used i...",1167


It seems that in fact, we have a miscellanea of products being reviewed. From motor oil to cement... 
It would have made more sense to gather separatly reviews for different products.

In [33]:
df[(df["length"]>31) & (df["length"]< 99)]
# so different products also present in reviews that are also not in the extreme length values.

Unnamed: 0,Reviews,length
1,"These long cables work fine for my truck, but ...",51
4,I purchased the 12' feet long cable set and th...,77
6,bought these for my k2500 suburban plenty of l...,58
8,The Coleman Cable 08665 12-Feet Heavy-Duty Tru...,38
9,"I have an old car, Its bound to need these som...",61
...,...,...
20421,I recently purchased 2 luxury Italian leather ...,57
20429,To me these LED lights are artistic and cool. ...,72
20432,These LED strip lights are well made and work ...,67
20440,"We've never had a product like this before, bu...",63


# Exporting data

In an easy to load format

In [38]:
df.to_csv(data_path+"/preprocessed.csv")