### Reading all of the JSON files Reviews and ProductInfo into a single Pandas dataframe 

In [2]:
#Importing libraries
import json,pandas as pd
import glob

In [3]:
#Reading the sample file
with open('Sample_Data/B000E7T7JO.json', 'rb') as file:
    File = json.load(file)

In [92]:
#Printing the File that was read in the previous command
print(File['Reviews'][:2])
print(File['ProductInfo'])

[{u'Author': u'Renee A.', u'ReviewID': u'R1XV15PBPT4F3N', u'Overall': u'4.0', u'Content': u'the one i was sent was the silver not blue version but overall, i liked the phone and it was mainly due to the simplicity of the phone.pros:speakerphonespacing of keysinside screenreceptionbattery lifecons:outside display of the contact name and the word "calling" next to it.... it seems redundant to have that.can only have individual ringtones if on the phone itself and not on SIM card.simple functions difficult to find without the help of manual.the cons are minor if you don\'t mind a few gliches but i recommend this phone for those who want a simple phone with quality over quantity.', u'Title': u'Good and simple phone.', u'Date': u'May 2, 2006'}, {u'Author': u'Fortran guy', u'ReviewID': u'R2NZE1TJ8U4OM6', u'Overall': u'3.0', u'Content': u'I bought this phone mainly to use in Italy for extended visits and plan to get an Italian SIM card. The phone itself works well (I tend to like Motorola pho

As we can see in the JSON data file there are two different keys, 'Reviews' and 'ProductInfo'

1. Reviews - The value for the Reviews is an array of dictionaries which contains the reviews by different authors for the same product.
2. ProductInfo - The value for ProductInfo describes about the product to which the reviews were written. 

Using the following code we are reading these two keys into different dataframes and concatenating them to form a final dataset which can used for further text analytics.


In [13]:
df_r = pd.DataFrame.from_dict(File['Reviews']) #Reading the Reviews into a dataframe
df_r.head()

Unnamed: 0,Author,Content,Date,Overall,ReviewID,Title
0,Renee A.,the one i was sent was the silver not blue ver...,"May 2, 2006",4.0,R1XV15PBPT4F3N,Good and simple phone.
1,Fortran guy,I bought this phone mainly to use in Italy for...,"December 27, 2008",3.0,R2NZE1TJ8U4OM6,OK phone --- but it is *NOT* unlocked
2,R. Truderung,Easy to uselong battery lifegood features if l...,"June 24, 2006",5.0,R2QVOAB2V519JB,simple & easy
3,A. Wiedlea,"The phone was not unlocked as advertised, and ...","November 22, 2008",1.0,R1W6775KPGPV8U,Not Unlocked
4,"Lost Soul ""TKW""",This is an excellent basic phone. No complain...,"April 25, 2008",5.0,RLE7RNJVY2LB2,Motorola V-190 Cell Phone


In [93]:
product_info = File['ProductInfo'] #Reading the ProductInfo into a Dictionary variable
print(type(product_info))
product_info

<type 'dict'>


{u'Features': u"This unlocked cell phone is compatible with GSM carriers like AT&T; and T-Mobile. Not all carrier features may be supported. It will not work with CDMA carriers like Verizon Wireless, Alltel and Sprint.\nQuad-band GSM cell phone compatible with 850/900/1800/1900 frequencies and GPRS capabilities\nDual LCD screens; integrated speakerphone; supports polyphonic and MP3 ringtones; SMS text messaging; organizer tools; USB connectivity\nUp to 11.25 hours of talk time and 700 hours of standby\nWhat's in the Box: Handset, battery, battery door, travel charger, user manual",
 u'ImgURL': u'http://ecx.images-amazon.com/images/I/412XG8654YL._SY300_.jpg',
 u'Name': u'Motorola V190 Unlocked Phone Quad-Band GSM, MP3, and SpeakerPhone--U.S. Version with Warranty (Black)',
 u'Price': u'$149.99',
 u'ProductID': u'B000E7T7JO'}

In [94]:
new_df = pd.concat([df_r, pd.DataFrame(product_info, index=df_r.index)], axis=1) #Concatenating Reviews and ProductInfo
new_df.head()

Unnamed: 0,Author,Content,Date,Overall,ReviewID,Title,Features,ImgURL,Name,Price,ProductID
0,Renee A.,the one i was sent was the silver not blue ver...,"May 2, 2006",4.0,R1XV15PBPT4F3N,Good and simple phone.,This unlocked cell phone is compatible with GS...,http://ecx.images-amazon.com/images/I/412XG865...,"Motorola V190 Unlocked Phone Quad-Band GSM, MP...",$149.99,B000E7T7JO
1,Fortran guy,I bought this phone mainly to use in Italy for...,"December 27, 2008",3.0,R2NZE1TJ8U4OM6,OK phone --- but it is *NOT* unlocked,This unlocked cell phone is compatible with GS...,http://ecx.images-amazon.com/images/I/412XG865...,"Motorola V190 Unlocked Phone Quad-Band GSM, MP...",$149.99,B000E7T7JO
2,R. Truderung,Easy to uselong battery lifegood features if l...,"June 24, 2006",5.0,R2QVOAB2V519JB,simple & easy,This unlocked cell phone is compatible with GS...,http://ecx.images-amazon.com/images/I/412XG865...,"Motorola V190 Unlocked Phone Quad-Band GSM, MP...",$149.99,B000E7T7JO
3,A. Wiedlea,"The phone was not unlocked as advertised, and ...","November 22, 2008",1.0,R1W6775KPGPV8U,Not Unlocked,This unlocked cell phone is compatible with GS...,http://ecx.images-amazon.com/images/I/412XG865...,"Motorola V190 Unlocked Phone Quad-Band GSM, MP...",$149.99,B000E7T7JO
4,"Lost Soul ""TKW""",This is an excellent basic phone. No complain...,"April 25, 2008",5.0,RLE7RNJVY2LB2,Motorola V-190 Cell Phone,This unlocked cell phone is compatible with GS...,http://ecx.images-amazon.com/images/I/412XG865...,"Motorola V190 Unlocked Phone Quad-Band GSM, MP...",$149.99,B000E7T7JO


#### Now that we have a dataframe in which Reviews and ProductInfo is combined, we are now applying the same logic to all JSON files in the directory to form a huge dataframe which can be used for text analytics.

### Creating data frame from all the files in the Sample data folder

In [9]:
Final_df = pd.DataFrame()
read_files = glob.glob("Sample_Data/*.json")
for i in range(0,len(read_files)):
    with open(read_files[i], 'rb') as x:
        JSONFILE = json.load(x)
    review_df = pd.DataFrame(JSONFILE['Reviews'])
    product = JSONFILE['ProductInfo']
    combined_df = pd.concat([review_df, pd.DataFrame(product, index=review_df.index)], axis=1)
    Final_df = pd.concat([Final_df, combined_df]) 

Final_df.head()

Unnamed: 0,Author,Content,Date,Overall,ReviewID,Title,Features,ImgURL,Name,Price,ProductID
0,Renee A.,the one i was sent was the silver not blue ver...,"May 2, 2006",4.0,R1XV15PBPT4F3N,Good and simple phone.,This unlocked cell phone is compatible with GS...,http://ecx.images-amazon.com/images/I/412XG865...,"Motorola V190 Unlocked Phone Quad-Band GSM, MP...",$149.99,B000E7T7JO
1,Fortran guy,I bought this phone mainly to use in Italy for...,"December 27, 2008",3.0,R2NZE1TJ8U4OM6,OK phone --- but it is *NOT* unlocked,This unlocked cell phone is compatible with GS...,http://ecx.images-amazon.com/images/I/412XG865...,"Motorola V190 Unlocked Phone Quad-Band GSM, MP...",$149.99,B000E7T7JO
2,R. Truderung,Easy to uselong battery lifegood features if l...,"June 24, 2006",5.0,R2QVOAB2V519JB,simple & easy,This unlocked cell phone is compatible with GS...,http://ecx.images-amazon.com/images/I/412XG865...,"Motorola V190 Unlocked Phone Quad-Band GSM, MP...",$149.99,B000E7T7JO
3,A. Wiedlea,"The phone was not unlocked as advertised, and ...","November 22, 2008",1.0,R1W6775KPGPV8U,Not Unlocked,This unlocked cell phone is compatible with GS...,http://ecx.images-amazon.com/images/I/412XG865...,"Motorola V190 Unlocked Phone Quad-Band GSM, MP...",$149.99,B000E7T7JO
4,"Lost Soul ""TKW""",This is an excellent basic phone. No complain...,"April 25, 2008",5.0,RLE7RNJVY2LB2,Motorola V-190 Cell Phone,This unlocked cell phone is compatible with GS...,http://ecx.images-amazon.com/images/I/412XG865...,"Motorola V190 Unlocked Phone Quad-Band GSM, MP...",$149.99,B000E7T7JO


### Creating and saving data frame from all the files in the Mobilephone folder (complete data ~130mb)

In [84]:
Final_df1 = pd.DataFrame()
read_files1 = glob.glob("/media/mukund/New Volume/Datasets/AmazonReviews/mobilephone/*.json")
for i in range(0,len(read_files1)):
    with open(read_files1[i], 'rb') as x:
        JSONFILE1 = json.load(x)
    review_df1 = pd.DataFrame(JSONFILE1['Reviews'])
    product1 = JSONFILE1['ProductInfo']
    combined_df1 = pd.concat([review_df1, pd.DataFrame(product1, index=review_df1.index)], axis=1)
    Final_df1 = pd.concat([Final_df1, combined_df1]) 

Final_df1.head()

Unnamed: 0,Author,Content,Date,Features,ImgURL,Name,Overall,Price,ProductID,ReviewID,Title
0,Dustin,Product came exactly as described and would re...,"March 10, 2014",,,,5.0,,1466736038,RYYNWQWW6LAC1,Great
1,Lancerman,I am very pleased with the phone we received. ...,"March 27, 2014",,,,5.0,,1466736038,R2G160TW2JWGD8,Excellent phone.
2,Maranda,The Samsung Galazy S3 is one of the best phone...,"April 3, 2014",,,,5.0,,1466736038,R3P9IS2JNG68K2,As described and a great phone
0,Jackie C,"The television was a refurbished one, and for ...","April 13, 2014",,,,4.0,,8987029395,R3LDJA7HU2Q0FS,nice television for the price
1,Sonja Trokey,Delivery was very prompt. The picture of this...,"April 1, 2014",,,,4.0,,8987029395,RAQB0MR9LGA2G,Service and product quality very good.


In [95]:
Final_df1.count()

Author       185718
Content      185993
Date         186015
Features     172753
ImgURL       175331
Name         175331
Overall      186016
Price        175257
ProductID    186016
ReviewID     186016
Title        186016
dtype: int64

In [96]:
#Resetting the Index as it is not sequential
Final_df1 = Final_df1.reset_index()

Unnamed: 0,index,Author,Content,Date,Features,ImgURL,Name,Overall,Price,ProductID,ReviewID,Title
0,0,Dustin,Product came exactly as described and would re...,"March 10, 2014",,,,5.0,,1466736038,RYYNWQWW6LAC1,Great
1,1,Lancerman,I am very pleased with the phone we received. ...,"March 27, 2014",,,,5.0,,1466736038,R2G160TW2JWGD8,Excellent phone.
2,2,Maranda,The Samsung Galazy S3 is one of the best phone...,"April 3, 2014",,,,5.0,,1466736038,R3P9IS2JNG68K2,As described and a great phone
3,0,Jackie C,"The television was a refurbished one, and for ...","April 13, 2014",,,,4.0,,8987029395,R3LDJA7HU2Q0FS,nice television for the price
4,1,Sonja Trokey,Delivery was very prompt. The picture of this...,"April 1, 2014",,,,4.0,,8987029395,RAQB0MR9LGA2G,Service and product quality very good.
5,0,RrB,This phone is the biggest piece of junk I have...,"December 28, 2011",Predictive text input anticipates what you're ...,http://ecx.images-amazon.com/images/I/412FfxHo...,HTC Dash / S620 / S621 (Excalibur) Black Windo...,1.0,Unavailable,9043435856,R2SNXGY9Z3IRMR,Don't buy it!
6,1,Peter,"From the Moment I opened the box, I fell head ...","November 15, 2010",Predictive text input anticipates what you're ...,http://ecx.images-amazon.com/images/I/412FfxHo...,HTC Dash / S620 / S621 (Excalibur) Black Windo...,5.0,Unavailable,9043435856,RI8F7YL5KZ7TB,Great Buy if you don't want to be Suckered int...
7,2,Phone does not work!,I purchased this phone thinking it was brand n...,"May 1, 2012",Predictive text input anticipates what you're ...,http://ecx.images-amazon.com/images/I/412FfxHo...,HTC Dash / S620 / S621 (Excalibur) Black Windo...,1.0,Unavailable,9043435856,RBJOWI21FRE27,Phone was refurbished
8,3,dm301,"This was a great phone - everything I wanted, ...","December 2, 2010",Predictive text input anticipates what you're ...,http://ecx.images-amazon.com/images/I/412FfxHo...,HTC Dash / S620 / S621 (Excalibur) Black Windo...,3.0,Unavailable,9043435856,R1R4VRGK2UTSYU,No touch screen
9,4,Julie Crowe,I chose a 5 star rate for this seller because ...,"October 11, 2013",Predictive text input anticipates what you're ...,http://ecx.images-amazon.com/images/I/412FfxHo...,HTC Dash / S620 / S621 (Excalibur) Black Windo...,5.0,Unavailable,9043435856,R1W6Y4BC52PWSI,This company is GREAT to do business with


### Saving the Dataframe for reusability

In [97]:
# SAVING THE DATAFRAME
Final_df1.to_pickle('full_dataset/full_dataset')