# Data Analysis - Part#1

### Objective: 
Apply data analysis concepts.

### Author: 
Nathalia Contreras

### Dataset used: 
[Amazon sample dataset](https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Gift_Card_v1_00.tsv.gz)
### Dataset size: 
11.6 MB

### Topic: Reading Data in Python

**a) Read Data**

In [1]:
#Defining path where data file is
path = './amazon_reviews_us_Gift_Card_v1_00.tsv.gz'

In [2]:
#Library to read zipped file
import gzip

In [3]:
#Open data
f = gzip.open(path, 'rt') #rt means read as text

In [4]:
#Read first line of dataset
#use next() or readline()
columns=f.readline() 

#First line corresponds to the HEADERS

Notice that everything is separated by '\t' because it's a TSV file. Plus there is '\n'

In [5]:
#Remove spaces and '\t'
columns = columns.strip().split('\t') #Read headers, remove spaces, and separate data where '\t'
columns

['marketplace',
 'customer_id',
 'review_id',
 'product_id',
 'product_parent',
 'product_title',
 'product_category',
 'star_rating',
 'helpful_votes',
 'total_votes',
 'vine',
 'verified_purchase',
 'review_headline',
 'review_body',
 'review_date']

**b) Organising Data**

In [6]:
#Empty list. This will store all data
lines = []
#Iterate line by line in dataset
for line in f:
    fields = line.strip().split('\t')
    lines.append(fields)

In [7]:
#Empty list stores data lines and headers
dataset = []
#Match lines[] to headers
for line in lines:
# Convert to key-value pairs
    d = dict(zip(columns, line))
    # Convert strings to integers for some fields:
    dataset.append(d)

In [8]:
#Print dataset first entry
dataset[0]

{'marketplace': 'US',
 'customer_id': '24371595',
 'review_id': 'R27ZP1F1CD0C3Y',
 'product_id': 'B004LLIL5A',
 'product_parent': '346014806',
 'product_title': 'Amazon eGift Card - Celebrate',
 'product_category': 'Gift Card',
 'star_rating': '5',
 'helpful_votes': '0',
 'total_votes': '0',
 'vine': 'N',
 'verified_purchase': 'Y',
 'review_headline': 'Five Stars',
 'review_body': 'Great birthday gift for a young adult.',
 'review_date': '2015-08-31'}

**c) Another way to read data and organise it**

Using CSV library

In [9]:
import csv
import gzip

full_lines = csv.reader(gzip.open(path, 'rt'), delimiter = '\t')
dataset = []

#Append lines to dataset
first = True
for line in full_lines:
    #First line is Header
    if first:
        header = line
        first = False
    else:
        d = dict(zip(header, line))
        dataset.append(d)

In [10]:
#1rst row
dataset[0] 

{'marketplace': 'US',
 'customer_id': '24371595',
 'review_id': 'R27ZP1F1CD0C3Y',
 'product_id': 'B004LLIL5A',
 'product_parent': '346014806',
 'product_title': 'Amazon eGift Card - Celebrate',
 'product_category': 'Gift Card',
 'star_rating': '5',
 'helpful_votes': '0',
 'total_votes': '0',
 'vine': 'N',
 'verified_purchase': 'Y',
 'review_headline': 'Five Stars',
 'review_body': 'Great birthday gift for a young adult.',
 'review_date': '2015-08-31'}

**d) Pre-processing Data**

In [11]:
#Define empty dataset2 to store filtered dataset
dataset2 = []
#Store columns in variable 'col'
col = columns
#Store lines (all data) in variable 'ln'
ln = lines
#Append lines to dataset
for line in ln:
    d1 = dict(zip(col, line)) #Convert to key-value pair
    #Capture filtered fields
    d2 = {}
    for field in ['marketplace', 'product_title', 'star_rating','review_date']: #Define required fields
        d2[field] = d1[field]
    dataset2.append(d2)

In [12]:
#Check the length of new dataset
len(dataset2)

149086

In [13]:
#Print 5th entry of new dataset
dataset2[5]

{'marketplace': 'US',
 'product_title': 'Amazon Gift Card - Print - Happy Birthday (Birds)',
 'star_rating': '5',
 'review_date': '2015-08-31'}

**e) Basic Statistics**

In [14]:
#Define empty dataset3 to store filtered dataset
dataset3 = []
col = columns
ln = lines

for line in ln:
    d1 = dict(zip(col, line)) #Convert to key-value pair
    # Converting Data Types:     
    d1['star_rating'] = int(d['star_rating'])
    d1['helpful_votes'] = int(d['helpful_votes'])
    d1['total_votes'] = int(d['total_votes'])
    
    d2 = {}
    for field in ['customer_id', 'product_id','star_rating','helpful_votes',
                  'total_votes','verified_purchase','review_date']: 
        d2[field] = d1[field]
    dataset3.append(d2)

**e.1) Calculate total number of reviews**

In [15]:
#Number of Gift Card Reviews in dataset
number_reviews = len(dataset3)
number_reviews

149086

**e.2) Calculate Average star_rating of all reviews**

In [16]:
#Average star_rating of all reviews
avg = 0
for x in dataset3:
    avg += x['star_rating'] #star_rating for each review
avg /= number_reviews #Total number of reviews
avg

5.0

**e.3) Another way to calculate the average star_rating of all reviews**

In [17]:
#Another way to calculate the average star_rating
ratings = [x['star_rating'] for x in dataset3]
#Calculate average
avg = sum(ratings)/len(ratings)
avg

5.0

**e.4) Calculate unique star_rating values of all reviews**

In [18]:
#Unique star_rating of reviews
numLowStar = set() #This function avoids element repetition + order in asc
for x in dataset3:
    if x['star_rating'] <= 5:
        numLowStar.add(x['star_rating'])
#Check 
len(numLowStar)

1

The length=1 means that there is only 1 unique star rating.

In [19]:
#Printing
numLowStar

{5}

The above value means there isn't a star_rating less than 5 stars. 
In other words, all reviewers have given 5 stars, hence that is the only value.

**e.5) Calculate how many unique products were reviewed**

In [20]:
#Unique product_id
numprod = set() #This function avoids element repetition + order in asc
for x in dataset3:
    if x['product_id']:
        numprod.add(x['product_id'])
#Check 
print(len(numprod))

1780


The above value means there are 1780 unique products in this dataset that have been reviewed.

**e.6) Another way to count total ratings**

In [21]:
#Another way to count ratings
from collections import defaultdict
#Define dictionary of integer values
count_ratings = defaultdict(int)
#Iterate in dataset
for x in dataset3:
    count_ratings[x['star_rating']] += 1

#Printing
count_ratings

defaultdict(int, {5: 149086})

The previous counting confirms that only 5-stars reviews exist.

**e.7) Calculate total number of verified and non-verified purchases**

In [22]:
#Count verified and non-verified purchases

#from collections import defaultdict #no need to import again if imported before

#Define dictionary of integer values
count_purchases = defaultdict(int)
#Iterate in dataset
for x in dataset3:
    count_purchases[x['verified_purchase']] += 1

#Printing
count_purchases

defaultdict(int, {'Y': 136042, 'N': 13044})

**e.8) Calculate the most popular product / the most reviewed**

In [23]:
#Calculate the most popular product / the most reviewed

#Define dictionary of integer values
count_popular = defaultdict(int)
#Iterate in dataset
for x in dataset3:
    count_popular[x['product_id']] += 1

#Transform to list
total_count = [(count_popular[p],p) for p in count_popular]
#Sort
total_count.sort()
#Printing 5 most popular products
total_count[-5:]

#This can complement e.5 section

[(3440, 'BT00CTOUNS'),
 (4315, 'B00IX1I3G6'),
 (5043, 'BT00DDVMVQ'),
 (6087, 'B00A48G0D4'),
 (28879, 'B004LLIKVU')]

**Author: [Nathalia Contreras](https://github.com/ncontrerass)**