## Analisi e pulizia dei dati

In [1]:
import pandas as pd

In [2]:
reviews = pd.read_csv('./archive/Reviews.csv')

In [3]:
reviews.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


### Nel dataset sono presenti dati nulli nelle colonne **ProfileName** e **Summary**, le quali, insieme alle colonne **Id**, **HelpfulnessNumerator** e **HelpfulnessDenominator** possono essere eliminate perchè non ci interressano per lo svolgimento dei Job

In [4]:
reviews.isnull().sum()

Id                         0
ProductId                  0
UserId                     0
ProfileName               16
HelpfulnessNumerator       0
HelpfulnessDenominator     0
Score                      0
Time                       0
Summary                   27
Text                       0
dtype: int64

In [5]:
reviews_job1 = reviews.copy(deep=True)
reviews_jobs = reviews.copy(deep=True)

### Per il Job1 abbiamo bisogno delle colonne **Time** e **Text**

In [6]:
reviews_job1.drop(['Id', 'ProductId', 'UserId', 'Score', 'ProfileName', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Summary'], inplace = True, axis = 1)

In [7]:
reviews_job1.head()

Unnamed: 0,Time,Text
0,1303862400,I have bought several of the Vitality canned d...
1,1346976000,Product arrived labeled as Jumbo Salted Peanut...
2,1219017600,This is a confection that has been around a fe...
3,1307923200,If you are looking for the secret ingredient i...
4,1350777600,Great taffy at a great price. There was a wid...


#### Come si può osservare di seguito, il testo contiene punteggiatura e tag HTML che dovranno essere rimossi durante la fase di mapping

In [8]:
reviews_job1.Text[10]

"I don't know if it's the cactus or the tequila or just the unique combination of ingredients, but the flavour of this hot sauce makes it one of a kind!  We picked up a bottle once on a trip we were on and brought it back home with us and were totally blown away!  When we realized that we simply couldn't find it anywhere in our city we were bummed.<br /><br />Now, because of the magic of the internet, we have a case of the sauce and are ecstatic because of it.<br /><br />If you love hot sauce..I mean really love hot sauce, but don't want a sauce that tastelessly burns your throat, grab a bottle of Tequila Picante Gourmet de Inclan.  Just realize that once you taste it, you will never want to use any other sauce.<br /><br />Thank you for the personal, incredible service!"

##### Funzione per creare file di diverse dimensioni da utilizzare nei Job

In [9]:
dataset_sizes = [0.5, 1, 2]
def sample_all_sizes(dataset, job):
    for size in dataset_sizes:
        n_rows = round(dataset.shape[0] * size)
        sampled_df = dataset.sample(n=n_rows, random_state=42, replace=True)
        filename = './dataset/reviews_{}_dim_{}.csv'.format(job, str(size).replace(".", ""))
        sampled_df.to_csv(filename, header=None, index=None, sep='\t', mode='w')

In [10]:
sample_all_sizes(reviews_job1, 'job1')

In [11]:
reviews_job1_05 = pd.read_csv('./dataset/reviews_job1_dim_05.csv', sep="\t", header=None)

In [12]:
reviews_job1_05

Unnamed: 0,0,1
0,1338336000,"I have 5-7 dogs at any given time, sometimes f..."
1,1297123200,I already liked regular Stash Earl Grey and so...
2,1199577600,eight oclock makes great coffee and with balan...
3,1309910400,I bought these for my kids but find myself eat...
4,1325203200,This is very good and just like the gourmet on...
...,...,...
284222,1336348800,"Taste great, though I think I prefer their che..."
284223,1293580800,Absolutely delicious. Exactly what I was looki...
284224,1326844800,To help me drink more water I used flavor crys...
284225,1250467200,We get Coffee People Oranic X-bold through sub...


### Per il Job2 e Job3 abbiamo bisogno delle colonne **UserId**, **ProductId** e **Score**

In [14]:
reviews_jobs.drop(['Id', 'ProfileName', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Time', 'Summary', 'Text'], inplace = True, axis = 1)

In [15]:
reviews_jobs.head()

Unnamed: 0,ProductId,UserId,Score
0,B001E4KFG0,A3SGXH7AUHU8GW,5
1,B00813GRG4,A1D87F6ZCVE5NK,1
2,B000LQOCH0,ABXLMWJIXXAIN,4
3,B000UA0QIQ,A395BORC6FGVXV,2
4,B006K2ZZ7K,A1UQRSCLF8GW1T,5


In [16]:
sample_all_sizes(reviews_jobs, 'jobs')

In [2]:
reviews_jobs_05 = pd.read_csv('./dataset/reviews_jobs_dim_05.csv', sep="\t", header=None)

In [3]:
reviews_jobs_05

Unnamed: 0,0,1,2
0,B003M63C0E,A27L3LYLHCQZYG,5
1,B000CQIDHY,A1AES697PC2IW5,5
2,B001EQ55MM,A1Q99N7YEJ6CZJ,5
3,B001PICX42,A3RJVINZDBOUNE,5
4,B000B6MV9Q,A2LN6GJQI1S9EW,4
...,...,...,...
284222,B006BXUVPY,A9JLE9BISQFUB,4
284223,B0002TVW24,AH7B7I1EQ0386,5
284224,B001E52VLG,A3QU9R1IZY03ZR,1
284225,B0029XLH4Y,A1JEY42M785KI7,5
